Performance Tuning¶

Guidelines for optimizing milk pipeline throughput and latency, covering OS-level configuration, process scheduling, memory layout, and GPU acceleration.

1. Measuring Performance¶

1.1. Loop Frequency¶

Use milk-procinfo-list to monitor the Hz column for each compute unit. This is the most direct measure of pipeline throughput.

1.2. Semaphore Latency¶

Benchmark the fundamental IPC latency:

$ milk-semloopspeed 10000

Typical values on modern hardware:

x86_64 (bare metal): 200–500 kHz
ARM (embedded): 50–150 kHz
VM / container: 30–100 kHz (overhead from virtualization)

1.3. Stream Timing¶

Use milk-streamCTRL to observe per-stream frame rates and detect bottlenecks in the pipeline.

2. CPU Pinning¶

Bind compute-critical processes to dedicated CPU cores to avoid context-switch overhead and cache thrashing.

2.1. Using `taskset`¶

# Pin to core 4
$ taskset -c 4 milk-fpsexec-mymodule fpsinit

# Pin to cores 4-7
$ taskset -c 4-7 milk-fpsexec-mymodule fpsinit

2.2. Kernel Isolation (`isolcpus`)¶

For maximum determinism, isolate cores from the Linux scheduler at boot time:

# Add to kernel command line (GRUB)
isolcpus=4,5,6,7

Then only explicitly pinned processes will run on those cores.

Important

Isolated cores are invisible to normal scheduling. All system services, interrupts, and other processes will be confined to the remaining cores.

3. Real-Time Scheduling¶

3.1. `SCHED_FIFO`¶

For latency-critical loops, elevate the scheduling policy:

$ sudo chrt -f 49 milk-fpsexec-mymodule fpsinit

Priority range: 1 (lowest) to 99 (highest). Avoid 99 which is reserved for kernel threads.

3.2. Permissions¶

To allow non-root real-time scheduling:

# /etc/security/limits.d/milk.conf
@realtime  -  rtprio  95
@realtime  -  memlock unlimited

Add users to the realtime group.

4. Shared Memory Configuration¶

4.1. tmpfs Size¶

By default, /dev/shm may be limited to 50% of RAM. For large stream buffers, increase it:

# Temporary
$ sudo mount -o remount,size=16G /dev/shm

# Permanent: add to /etc/fstab
tmpfs /dev/shm tmpfs defaults,size=16G 0 0

4.2. Huge Pages¶

For very large streams, huge pages reduce TLB misses:

# Allocate 1024 huge pages (2 MB each = 2 GB)
$ echo 1024 | sudo tee \
    /proc/sys/vm/nr_hugepages

To enable huge pages for ImageStreamIO shared memory streams ≥ 2 MB:

$ export MILK_SHM_HUGETLB=1

When set, ImageStreamIO_createIm_gpu() uses MAP_HUGETLB for its mmap() call. If huge pages are unavailable, it falls back to normal pages automatically.

5. GPU Acceleration (CUDA)¶

5.1. Build with CUDA¶

$ cmake .. -DUSE_CUDA=ON \
    -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda
$ make -j$(nproc)

5.2. GPU Memory Pinning¶

Use cudaHostRegister() to pin shared memory buffers for zero-copy GPU access. This is handled automatically by GPU-enabled modules (e.g., linalgebra with cuBLAS).

5.3. Multi-GPU¶

Set the GPU device before launching:

$ CUDA_VISIBLE_DEVICES=1 milk-fpsexec-gpumodule fpsinit

6. Network Tuning (for Valkey sync)¶

When using milk-fps-valkey across hosts, minimize network latency:

# Disable Nagle's algorithm
$ sysctl -w net.ipv4.tcp_nodelay=1

# Increase socket buffer sizes
$ sysctl -w net.core.rmem_max=16777216
$ sysctl -w net.core.wmem_max=16777216

7. Quick Checklist¶

Item	Command / Config
Check loop Hz	`milk-procinfo-list`
Semaphore benchmark	`milk-semloopspeed 10000`
Pin to core 4	`taskset -c 4 <cmd>`
RT priority 49	`sudo chrt -f 49 <cmd>`
Increase `/dev/shm`	`mount -o remount,size=16G /dev/shm`
Huge pages for streams	`export MILK_SHM_HUGETLB=1`
Enable CUDA	`cmake .. -DUSE_CUDA=ON`
Vec missed report	`cmake .. -DVEC_REPORT=ON`
PGO build	`cmake .. -DUSE_PGO=GENERATE`
Static LTO build	`cmake .. -DUSE_STATIC_LTO=ON`
PGO + Static LTO	`-DUSE_PGO=USE -DUSE_STATIC_LTO=ON`
Valkey low-latency	`sysctl -w net.ipv4.tcp_nodelay=1`

← Documentation Index