Performance Tuning¶
Guidelines for optimizing milk pipeline throughput and
latency, covering OS-level configuration, process
scheduling, memory layout, and GPU acceleration.
See also: Process Info · Streams · FPS · Debugging · FAQ · Code-Level Optimization Rules · Code Assist Tools
1. Measuring Performance¶
1.1. Loop Frequency¶
Use milk-procinfo-list to monitor the Hz column for
each compute unit. This is the most direct measure of
pipeline throughput.
1.2. Semaphore Latency¶
Benchmark the fundamental IPC latency:
Typical values on modern hardware:
- x86_64 (bare metal): 200–500 kHz
- ARM (embedded): 50–150 kHz
- VM / container: 30–100 kHz (overhead from virtualization)
1.3. Stream Timing¶
Use milk-streamCTRL to observe per-stream frame rates
and detect bottlenecks in the pipeline.
2. CPU Pinning¶
Bind compute-critical processes to dedicated CPU cores to avoid context-switch overhead and cache thrashing.
2.1. Using taskset¶
# Pin to core 4
$ taskset -c 4 milk-fpsexec-mymodule fpsinit
# Pin to cores 4-7
$ taskset -c 4-7 milk-fpsexec-mymodule fpsinit
2.2. Kernel Isolation (isolcpus)¶
For maximum determinism, isolate cores from the Linux scheduler at boot time:
Then only explicitly pinned processes will run on those cores.
[!IMPORTANT] Isolated cores are invisible to normal scheduling. All system services, interrupts, and other processes will be confined to the remaining cores.
3. Real-Time Scheduling¶
3.1. SCHED_FIFO¶
For latency-critical loops, elevate the scheduling policy:
Priority range: 1 (lowest) to 99 (highest). Avoid 99 which is reserved for kernel threads.
3.2. Permissions¶
To allow non-root real-time scheduling:
Add users to the realtime group.
4. Shared Memory Configuration¶
4.1. tmpfs Size¶
By default, /dev/shm may be limited to 50% of RAM.
For large stream buffers, increase it:
# Temporary
$ sudo mount -o remount,size=16G /dev/shm
# Permanent: add to /etc/fstab
tmpfs /dev/shm tmpfs defaults,size=16G 0 0
4.2. Huge Pages¶
For very large streams, huge pages reduce TLB misses:
To enable huge pages for ImageStreamIO shared
memory streams ≥ 2 MB:
When set, ImageStreamIO_createIm_gpu() uses
MAP_HUGETLB for its mmap() call. If huge
pages are unavailable, it falls back to normal
pages automatically.
5. GPU Acceleration (CUDA)¶
5.1. Build with CUDA¶
5.2. GPU Memory Pinning¶
Use cudaHostRegister() to pin shared memory buffers
for zero-copy GPU access. This is handled automatically
by GPU-enabled modules (e.g., linalgebra with cuBLAS).
5.3. Multi-GPU¶
Set the GPU device before launching:
6. Network Tuning (for Valkey sync)¶
When using milk-fps-valkey across hosts, minimize
network latency:
# Disable Nagle's algorithm
$ sysctl -w net.ipv4.tcp_nodelay=1
# Increase socket buffer sizes
$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.wmem_max=16777216
See also: Valkey Integration
7. Quick Checklist¶
| Item | Command / Config |
|---|---|
| Check loop Hz | milk-procinfo-list |
| Semaphore benchmark | milk-semloopspeed 10000 |
| Pin to core 4 | taskset -c 4 <cmd> |
| RT priority 49 | sudo chrt -f 49 <cmd> |
Increase /dev/shm |
mount -o remount,size=16G /dev/shm |
| Huge pages for streams | export MILK_SHM_HUGETLB=1 |
| Enable CUDA | cmake .. -DUSE_CUDA=ON |
| Vec missed report | cmake .. -DVEC_REPORT=ON |
| PGO build | cmake .. -DUSE_PGO=GENERATE |
| Static LTO build | cmake .. -DUSE_STATIC_LTO=ON |
| PGO + Static LTO | -DUSE_PGO=USE -DUSE_STATIC_LTO=ON |
| Valkey low-latency | sysctl -w net.ipv4.tcp_nodelay=1 |