⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@ludfjig
Copy link
Contributor

@ludfjig ludfjig commented Jan 13, 2026

Should speed up memory-heavy things in hyperlight such as restoring snapshots, copying memory parameters, etc. u128 proved to be faster than u64.

image image

Inspired by postgres.

Signed-off-by: Ludvig Liljenberg <[email protected]>
@ludfjig ludfjig added kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. area/performance Addresses performance labels Jan 13, 2026
@ludfjig ludfjig changed the title Optimize shared mem memcpy operations Optimize shared memory memcpy operations Jan 13, 2026
@ludfjig ludfjig changed the title Optimize shared memory memcpy operations Optimize shared memory operations Jan 13, 2026
Signed-off-by: Ludvig Liljenberg <[email protected]>

Change u64 to u128

Signed-off-by: Ludvig Liljenberg <[email protected]>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes shared memory operations in Hyperlight by implementing chunked aligned memory access using u128 (16-byte) operations, significantly improving performance for memory-heavy operations like snapshot restoration and parameter copying. The optimization is inspired by PostgreSQL's memory handling techniques.

Changes:

  • Replaced byte-by-byte memory operations with aligned u128 chunk processing
  • Added comprehensive benchmarks for shared memory operations (fill, copy_to_slice, copy_from_slice)
  • Implemented three-phase approach: handle unaligned head bytes, process aligned chunks, handle remaining tail bytes

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/hyperlight_host/src/mem/shared_mem.rs Optimized copy_to_slice, copy_from_slice, and fill methods with aligned u128 chunk operations
src/hyperlight_host/benches/benchmarks.rs Added new benchmark suite for shared memory operations with 1MB and 64MB test sizes


// Handle unaligned head bytes until we reach u128 alignment
let align_offset = base.align_offset(align_of::<u128>());
let head_len = align_offset.min(len);
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change
let head_len = align_offset.min(len);
let head_len = if align_offset == usize::MAX {
// If alignment is impossible for this pointer, fall back to a fully
// byte-wise fill. This is functionally correct but skips the u128
// chunk optimization; this behavior is intentional and rare.
len
} else {
align_offset.min(len)
};

Copilot uses AI. Check for mistakes.
let (hshm, _) = eshm.build();
let mut dst = vec![0u8; size];
b.iter(|| {
hshm.copy_to_slice(&mut dst, 0).unwrap();
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shared memory objects are created outside the benchmark iteration, which is correct. However, for the copy_to_slice benchmark, repeatedly reading the same memory might benefit from CPU cache warming, potentially giving unrealistic performance results. Consider using black_box on the destination buffer or writing different data to shared memory between iterations to prevent cache effects from dominating the benchmark results.

Suggested change
hshm.copy_to_slice(&mut dst, 0).unwrap();
hshm.copy_to_slice(&mut dst, 0).unwrap();
criterion::black_box(&dst);

Copilot uses AI. Check for mistakes.
while i + CHUNK <= len {
let chunk: [u8; CHUNK] = slice[i..i + CHUNK]
.try_into()
.map_err(|_| new_error!("slice length mismatch"))?;
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message "slice length mismatch" is not particularly helpful for debugging. Consider making it more descriptive, such as "Failed to convert slice to fixed-size array for u128 chunk" or including the expected and actual lengths in the error message.

Suggested change
.map_err(|_| new_error!("slice length mismatch"))?;
.map_err(|_| {
new_error!(
"Failed to convert slice to fixed-size array for u128 chunk: \
expected length {}, got {} (total slice len {}, offset {})",
CHUNK,
slice[i..i + CHUNK].len(),
len,
i,
)
})?;

Copilot uses AI. Check for mistakes.
Comment on lines +788 to +815
const CHUNK: usize = size_of::<u128>();
let len = slice.len();
let mut i = 0;

// Handle unaligned head bytes until we reach u128 alignment
let align_offset = base.align_offset(align_of::<u128>());
let head_len = align_offset.min(len);
while i < head_len {
unsafe {
*b = base.wrapping_add(i).read_volatile();
slice[i] = base.add(i).read_volatile();
}
i += 1;
}

// Read aligned u128 chunks
while i + CHUNK <= len {
let value = unsafe { (base.add(i) as *const u128).read_volatile() };
slice[i..i + CHUNK].copy_from_slice(&value.to_ne_bytes());
i += CHUNK;
}

// Handle remaining tail bytes
while i < len {
unsafe {
slice[i] = base.add(i).read_volatile();
}
i += 1;
}
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimized memory operations should have test coverage for various alignment scenarios to ensure correctness. Consider adding tests that verify:

  1. Copies with different starting offsets (0, 1, 2, ..., 15 bytes) to test all possible alignment cases
  2. Copies of various lengths (less than 16 bytes, exactly 16 bytes, and lengths that don't align to 16-byte boundaries)
  3. Fills with unaligned start and various lengths

The existing tests only check basic functionality and boundary conditions but don't verify the alignment-specific logic paths introduced by this optimization.

Copilot uses AI. Check for mistakes.
let mut i = 0;

// Handle unaligned head bytes until we reach u128 alignment
let align_offset = base.align_offset(align_of::<u128>());
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change
let align_offset = base.align_offset(align_of::<u128>());
let align_offset = base.align_offset(align_of::<u128>());
if align_offset == usize::MAX {
// If the pointer cannot be aligned to u128, fall back to a full
// byte-by-byte copy. This is functionally correct and avoids
// relying on undefined alignment behavior; it just means the
// u128 chunk optimization is skipped in this rare case.
while i < len {
unsafe {
slice[i] = base.add(i).read_volatile();
}
i += 1;
}
drop(guard);
return Ok(());
}

Copilot uses AI. Check for mistakes.

// Handle unaligned head bytes until we reach u128 alignment
let align_offset = base.align_offset(align_of::<u128>());
let head_len = align_offset.min(len);
Copy link

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change
let head_len = align_offset.min(len);
// If align_offset is usize::MAX it's impossible to reach u128 alignment
// from `base`. In that case we intentionally fall back to byte-by-byte
// writes for the entire slice, which is functionally correct but skips
// the aligned chunk optimization below.
let head_len = if align_offset == usize::MAX {
len
} else {
align_offset.min(len)
};

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/performance Addresses performance kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant