-
Notifications
You must be signed in to change notification settings - Fork 156
Optimize shared memory operations #1167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Ludvig Liljenberg <[email protected]>
Signed-off-by: Ludvig Liljenberg <[email protected]> Change u64 to u128 Signed-off-by: Ludvig Liljenberg <[email protected]>
5ac906a to
ab50454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes shared memory operations in Hyperlight by implementing chunked aligned memory access using u128 (16-byte) operations, significantly improving performance for memory-heavy operations like snapshot restoration and parameter copying. The optimization is inspired by PostgreSQL's memory handling techniques.
Changes:
- Replaced byte-by-byte memory operations with aligned u128 chunk processing
- Added comprehensive benchmarks for shared memory operations (fill, copy_to_slice, copy_from_slice)
- Implemented three-phase approach: handle unaligned head bytes, process aligned chunks, handle remaining tail bytes
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/hyperlight_host/src/mem/shared_mem.rs | Optimized copy_to_slice, copy_from_slice, and fill methods with aligned u128 chunk operations |
| src/hyperlight_host/benches/benchmarks.rs | Added new benchmark suite for shared memory operations with 1MB and 64MB test sizes |
|
|
||
| // Handle unaligned head bytes until we reach u128 alignment | ||
| let align_offset = base.align_offset(align_of::<u128>()); | ||
| let head_len = align_offset.min(len); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.
| let head_len = align_offset.min(len); | |
| let head_len = if align_offset == usize::MAX { | |
| // If alignment is impossible for this pointer, fall back to a fully | |
| // byte-wise fill. This is functionally correct but skips the u128 | |
| // chunk optimization; this behavior is intentional and rare. | |
| len | |
| } else { | |
| align_offset.min(len) | |
| }; |
| let (hshm, _) = eshm.build(); | ||
| let mut dst = vec![0u8; size]; | ||
| b.iter(|| { | ||
| hshm.copy_to_slice(&mut dst, 0).unwrap(); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared memory objects are created outside the benchmark iteration, which is correct. However, for the copy_to_slice benchmark, repeatedly reading the same memory might benefit from CPU cache warming, potentially giving unrealistic performance results. Consider using black_box on the destination buffer or writing different data to shared memory between iterations to prevent cache effects from dominating the benchmark results.
| hshm.copy_to_slice(&mut dst, 0).unwrap(); | |
| hshm.copy_to_slice(&mut dst, 0).unwrap(); | |
| criterion::black_box(&dst); |
| while i + CHUNK <= len { | ||
| let chunk: [u8; CHUNK] = slice[i..i + CHUNK] | ||
| .try_into() | ||
| .map_err(|_| new_error!("slice length mismatch"))?; |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message "slice length mismatch" is not particularly helpful for debugging. Consider making it more descriptive, such as "Failed to convert slice to fixed-size array for u128 chunk" or including the expected and actual lengths in the error message.
| .map_err(|_| new_error!("slice length mismatch"))?; | |
| .map_err(|_| { | |
| new_error!( | |
| "Failed to convert slice to fixed-size array for u128 chunk: \ | |
| expected length {}, got {} (total slice len {}, offset {})", | |
| CHUNK, | |
| slice[i..i + CHUNK].len(), | |
| len, | |
| i, | |
| ) | |
| })?; |
| const CHUNK: usize = size_of::<u128>(); | ||
| let len = slice.len(); | ||
| let mut i = 0; | ||
|
|
||
| // Handle unaligned head bytes until we reach u128 alignment | ||
| let align_offset = base.align_offset(align_of::<u128>()); | ||
| let head_len = align_offset.min(len); | ||
| while i < head_len { | ||
| unsafe { | ||
| *b = base.wrapping_add(i).read_volatile(); | ||
| slice[i] = base.add(i).read_volatile(); | ||
| } | ||
| i += 1; | ||
| } | ||
|
|
||
| // Read aligned u128 chunks | ||
| while i + CHUNK <= len { | ||
| let value = unsafe { (base.add(i) as *const u128).read_volatile() }; | ||
| slice[i..i + CHUNK].copy_from_slice(&value.to_ne_bytes()); | ||
| i += CHUNK; | ||
| } | ||
|
|
||
| // Handle remaining tail bytes | ||
| while i < len { | ||
| unsafe { | ||
| slice[i] = base.add(i).read_volatile(); | ||
| } | ||
| i += 1; | ||
| } |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimized memory operations should have test coverage for various alignment scenarios to ensure correctness. Consider adding tests that verify:
- Copies with different starting offsets (0, 1, 2, ..., 15 bytes) to test all possible alignment cases
- Copies of various lengths (less than 16 bytes, exactly 16 bytes, and lengths that don't align to 16-byte boundaries)
- Fills with unaligned start and various lengths
The existing tests only check basic functionality and boundary conditions but don't verify the alignment-specific logic paths introduced by this optimization.
| let mut i = 0; | ||
|
|
||
| // Handle unaligned head bytes until we reach u128 alignment | ||
| let align_offset = base.align_offset(align_of::<u128>()); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.
| let align_offset = base.align_offset(align_of::<u128>()); | |
| let align_offset = base.align_offset(align_of::<u128>()); | |
| if align_offset == usize::MAX { | |
| // If the pointer cannot be aligned to u128, fall back to a full | |
| // byte-by-byte copy. This is functionally correct and avoids | |
| // relying on undefined alignment behavior; it just means the | |
| // u128 chunk optimization is skipped in this rare case. | |
| while i < len { | |
| unsafe { | |
| slice[i] = base.add(i).read_volatile(); | |
| } | |
| i += 1; | |
| } | |
| drop(guard); | |
| return Ok(()); | |
| } |
|
|
||
| // Handle unaligned head bytes until we reach u128 alignment | ||
| let align_offset = base.align_offset(align_of::<u128>()); | ||
| let head_len = align_offset.min(len); |
Copilot
AI
Jan 13, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.
| let head_len = align_offset.min(len); | |
| // If align_offset is usize::MAX it's impossible to reach u128 alignment | |
| // from `base`. In that case we intentionally fall back to byte-by-byte | |
| // writes for the entire slice, which is functionally correct but skips | |
| // the aligned chunk optimization below. | |
| let head_len = if align_offset == usize::MAX { | |
| len | |
| } else { | |
| align_offset.min(len) | |
| }; |
Should speed up memory-heavy things in hyperlight such as restoring snapshots, copying memory parameters, etc. u128 proved to be faster than u64.
Inspired by postgres.