Optimize shared memory operations #1167

ludfjig · 2026-01-13T19:53:55Z

Should speed up memory-heavy things in hyperlight such as restoring snapshots, copying memory parameters, etc. u128 proved to be faster than u64.

Inspired by postgres.

Signed-off-by: Ludvig Liljenberg <[email protected]>

Signed-off-by: Ludvig Liljenberg <[email protected]> Change u64 to u128 Signed-off-by: Ludvig Liljenberg <[email protected]>

Copilot

Pull request overview

This PR optimizes shared memory operations in Hyperlight by implementing chunked aligned memory access using u128 (16-byte) operations, significantly improving performance for memory-heavy operations like snapshot restoration and parameter copying. The optimization is inspired by PostgreSQL's memory handling techniques.

Changes:

Replaced byte-by-byte memory operations with aligned u128 chunk processing
Added comprehensive benchmarks for shared memory operations (fill, copy_to_slice, copy_from_slice)
Implemented three-phase approach: handle unaligned head bytes, process aligned chunks, handle remaining tail bytes

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
src/hyperlight_host/src/mem/shared_mem.rs	Optimized `copy_to_slice`, `copy_from_slice`, and `fill` methods with aligned u128 chunk operations
src/hyperlight_host/benches/benchmarks.rs	Added new benchmark suite for shared memory operations with 1MB and 64MB test sizes

Copilot · 2026-01-13T21:14:08Z

src/hyperlight_host/src/mem/shared_mem.rs

+
+        // Handle unaligned head bytes until we reach u128 alignment
+        let align_offset = base.align_offset(align_of::<u128>());
+        let head_len = align_offset.min(len);


The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change

let head_len = align_offset.min(len);

let head_len = if align_offset == usize::MAX {

// If alignment is impossible for this pointer, fall back to a fully

// byte-wise fill. This is functionally correct but skips the u128

// chunk optimization; this behavior is intentional and rare.

len

} else {

align_offset.min(len)

};

Copilot · 2026-01-13T21:14:08Z

src/hyperlight_host/benches/benchmarks.rs

+                let (hshm, _) = eshm.build();
+                let mut dst = vec![0u8; size];
+                b.iter(|| {
+                    hshm.copy_to_slice(&mut dst, 0).unwrap();


The shared memory objects are created outside the benchmark iteration, which is correct. However, for the copy_to_slice benchmark, repeatedly reading the same memory might benefit from CPU cache warming, potentially giving unrealistic performance results. Consider using black_box on the destination buffer or writing different data to shared memory between iterations to prevent cache effects from dominating the benchmark results.

Suggested change

hshm.copy_to_slice(&mut dst, 0).unwrap();

hshm.copy_to_slice(&mut dst, 0).unwrap();

criterion::black_box(&dst);

Copilot · 2026-01-13T21:14:08Z

src/hyperlight_host/src/mem/shared_mem.rs

+        while i + CHUNK <= len {
+            let chunk: [u8; CHUNK] = slice[i..i + CHUNK]
+                .try_into()
+                .map_err(|_| new_error!("slice length mismatch"))?;


The error message "slice length mismatch" is not particularly helpful for debugging. Consider making it more descriptive, such as "Failed to convert slice to fixed-size array for u128 chunk" or including the expected and actual lengths in the error message.

Suggested change

.map_err(|_| new_error!("slice length mismatch"))?;

.map_err(|_| {

new_error!(

"Failed to convert slice to fixed-size array for u128 chunk: \

expected length {}, got {} (total slice len {}, offset {})",

CHUNK,

slice[i..i + CHUNK].len(),

len,

i,

)

})?;

Copilot · 2026-01-13T21:14:09Z

src/hyperlight_host/src/mem/shared_mem.rs

+        const CHUNK: usize = size_of::<u128>();
+        let len = slice.len();
+        let mut i = 0;
+
+        // Handle unaligned head bytes until we reach u128 alignment
+        let align_offset = base.align_offset(align_of::<u128>());
+        let head_len = align_offset.min(len);
+        while i < head_len {
            unsafe {
-                *b = base.wrapping_add(i).read_volatile();
+                slice[i] = base.add(i).read_volatile();
            }
+            i += 1;
        }
+
+        // Read aligned u128 chunks
+        while i + CHUNK <= len {
+            let value = unsafe { (base.add(i) as *const u128).read_volatile() };
+            slice[i..i + CHUNK].copy_from_slice(&value.to_ne_bytes());
+            i += CHUNK;
+        }
+
+        // Handle remaining tail bytes
+        while i < len {
+            unsafe {
+                slice[i] = base.add(i).read_volatile();
+            }
+            i += 1;
+        }


The optimized memory operations should have test coverage for various alignment scenarios to ensure correctness. Consider adding tests that verify:

Copies with different starting offsets (0, 1, 2, ..., 15 bytes) to test all possible alignment cases

Copies of various lengths (less than 16 bytes, exactly 16 bytes, and lengths that don't align to 16-byte boundaries)

Fills with unaligned start and various lengths

The existing tests only check basic functionality and boundary conditions but don't verify the alignment-specific logic paths introduced by this optimization.

Copilot · 2026-01-13T21:14:09Z

src/hyperlight_host/src/mem/shared_mem.rs

+        let mut i = 0;
+
+        // Handle unaligned head bytes until we reach u128 alignment
+        let align_offset = base.align_offset(align_of::<u128>());


The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change

let align_offset = base.align_offset(align_of::<u128>());

let align_offset = base.align_offset(align_of::<u128>());

if align_offset == usize::MAX {

// If the pointer cannot be aligned to u128, fall back to a full

// byte-by-byte copy. This is functionally correct and avoids

// relying on undefined alignment behavior; it just means the

// u128 chunk optimization is skipped in this rare case.

while i < len {

unsafe {

slice[i] = base.add(i).read_volatile();

}

i += 1;

}

drop(guard);

return Ok(());

}

Copilot · 2026-01-13T21:14:09Z

src/hyperlight_host/src/mem/shared_mem.rs

+
+        // Handle unaligned head bytes until we reach u128 alignment
+        let align_offset = base.align_offset(align_of::<u128>());
+        let head_len = align_offset.min(len);


The align_offset method can return usize::MAX if it's impossible to align the pointer (though this is rare in practice). When this happens, head_len = usize::MAX.min(len) = len, so all bytes are processed byte-by-byte in the head loop, and the aligned chunk loop is skipped (since i == len). While this is functionally correct, it means the optimization is completely bypassed in such cases. Consider adding a comment to clarify this behavior, or explicitly check for usize::MAX and document that the fallback to byte-by-byte operations is intentional in this case.

Suggested change

let head_len = align_offset.min(len);

// If align_offset is usize::MAX it's impossible to reach u128 alignment

// from `base`. In that case we intentionally fall back to byte-by-byte

// writes for the entire slice, which is functionally correct but skips

// the aligned chunk optimization below.

let head_len = if align_offset == usize::MAX {

len

} else {

align_offset.min(len)

};

Benchmark for sharedmemory operations

d9e68b1

Signed-off-by: Ludvig Liljenberg <[email protected]>

ludfjig added kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. area/performance Addresses performance labels Jan 13, 2026

ludfjig changed the title ~~Optimize shared mem memcpy operations~~ Optimize shared memory memcpy operations Jan 13, 2026

ludfjig changed the title ~~Optimize shared memory memcpy operations~~ Optimize shared memory operations Jan 13, 2026

Optimize shared memory operations

ab50454

Signed-off-by: Ludvig Liljenberg <[email protected]> Change u64 to u128 Signed-off-by: Ludvig Liljenberg <[email protected]>

ludfjig force-pushed the optimize_shared_mem branch from 5ac906a to ab50454 Compare January 13, 2026 20:54

ludfjig requested a review from Copilot January 13, 2026 21:09

Copilot started reviewing on behalf of ludfjig January 13, 2026 21:10 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize shared memory operations #1167

Optimize shared memory operations #1167

ludfjig commented Jan 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        let head_len = align_offset.min(len);
+        let head_len = if align_offset == usize::MAX {
+            // If alignment is impossible for this pointer, fall back to a fully
+            // byte-wise fill. This is functionally correct but skips the u128
+            // chunk optimization; this behavior is intentional and rare.
+            len
+        } else {
+            align_offset.min(len)
+        };

	hshm.copy_to_slice(&mut dst, 0).unwrap();
	hshm.copy_to_slice(&mut dst, 0).unwrap();
	criterion::black_box(&dst);

-                .map_err(|_| new_error!("slice length mismatch"))?;
+                .map_err(|_| {
+                    new_error!(
+                        "Failed to convert slice to fixed-size array for u128 chunk: \
+                         expected length {}, got {} (total slice len {}, offset {})",
+                        CHUNK,
+                        slice[i..i + CHUNK].len(),
+                        len,
+                        i,
+                    )
+                })?;

Optimize shared memory operations #1167

Are you sure you want to change the base?

Optimize shared memory operations #1167

Conversation

ludfjig commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ludfjig commented Jan 13, 2026 •

edited

Loading