Profile-Guided Optimization (PGO) & Link-Time Optimization (LTO)¶
The milk build system supports two complementary compiler optimization techniques for fpsexec standalone executables:
| Technique | CMake Flag | Typical Speedup |
|---|---|---|
| PGO | -DUSE_PGO=GENERATE/USE | 10–30% |
| LTO | -DUSE_STATIC_LTO=ON | 5–15% |
| PGO + LTO | Both | 15–40% |
Tip
For maximum performance on production AO systems, enable both PGO and static LTO. The techniques are complementary — LTO exposes cross-library code to GCC, and PGO then trains the optimizer with real branch/call data across that larger scope.
1. Link-Time Optimization (LTO)¶
1.1. What LTO Does¶
LTO (-flto=auto) allows GCC to optimize across translation-unit boundaries. Instead of compiling each .c file in isolation, GCC serializes its intermediate representation (GIMPLE IR) into object files and performs whole-program optimization at link time.
Without static LTO, standalone executables link shared libraries (.so), which are opaque — the linker cannot see inside them:
fpsexec.c → fpsexec.o → fpsexec
↓ (dynamic)
libmilkfps.so ← opaque to LTO
libImageStreamIO.so ← opaque
With static LTO (USE_STATIC_LTO=ON), the same libraries are linked as static archives (.a). GCC can now inline across all library boundaries:
fpsexec.c → fpsexec.o ──┐
libmilkfps.a ──────────────┼→ LTO link → fpsexec
libImageStreamIO.a ────────┤ (full visibility)
libCOREMODmemory_compute.a ─┘
1.2. Why Static Linking Is Faster¶
Statically linked executables eliminate several layers of runtime overhead present in dynamically linked binaries:
PLT/GOT elimination. Every call to a shared library function goes through the Procedure Linkage Table (PLT) and Global Offset Table (GOT) — an indirect jump that the CPU branch predictor cannot fully resolve. In a tight real-time loop calling ImageStreamIO_sempost() or fps_to_local() at tens of kHz, these indirect jumps add measurable latency. Static linking replaces them with direct calls or inlined code.
No dynamic loader overhead. At startup, ld.so must resolve all symbol relocations, map .so pages, and apply RELRO protections. Standalone executables with 14 shared libraries pay this cost on every launch. A statically linked binary is ready to execute immediately — important for rapid fault recovery in production AO loops.
Improved branch prediction. Direct calls from static linking have fixed target addresses known at link time. The CPU's branch target buffer (BTB) can predict these perfectly after the first execution, whereas PLT stubs pollute the BTB with indirect entries.
1.3. How LTO Keeps Static Binaries Small¶
A naïve static link pulls in every object file from each .a archive — even functions the executable never calls. This would bloat binaries unacceptably. LTO solves this:
Dead-code elimination. With -flto=auto, GCC sees the entire program (executable + all static archives) as a single optimization unit. It traces all reachable call paths from main() and discards every function, variable, and data structure that is not reachable. Entire translation units that are unused vanish from the final binary.
Cross-module inlining + elimination. LTO first inlines small hot functions across library boundaries (e.g., ImageStreamIO_sempost() into your compute loop). After inlining, the original library function may have zero remaining callers — LTO then eliminates it entirely. The net effect: the binary contains only the machine code that actually executes, inlined at the call sites where it's needed.
Measured example:
| Binary | Dynamic | Static+LTO | Ratio |
|---|---|---|---|
arith-crop2D | 52 KB (.so deps: ~2.4 MB) | 173 KB (self-contained) | 3.3× larger on disk |
The static binary is 3.3× larger than the dynamic stub, but far smaller than the total code footprint of 14 shared libraries (2.4 MB mapped in aggregate). LTO stripped the vast majority of library code that crop2D never calls.
1.4. Why Small Binaries Run Faster¶
Modern CPUs execute code from the instruction cache (L1i), typically 32–64 KB. When the executable's hot path fits in icache, the CPU never stalls waiting for instruction fetches from L2/L3:
┌──────────────────────────────────────┐
│ L1 Instruction Cache (32-64 KB) │
│ │
│ Dynamic link: │
│ fpsexec code + PLT stubs │
│ + scattered .so code pages │
│ → icache thrashing, frequent misses │
│ │
│ Static LTO: │
│ fpsexec code + inlined hot paths │
│ → compact, sequential, cache-warm │
└──────────────────────────────────────┘
Dynamic linking scatters hot code. The compute function lives in the executable, but sem_post() is in libpthread.so, ImageStreamIO_sempost() is in libImageStreamIO.so, and fps_to_local() is in libmilkfps.so. Each call jumps to a different memory region, evicting other hot code from icache.
Static LTO consolidates hot code. After inlining, the compute function, semaphore operations, and FPS parameter access are all compiled into a single contiguous code region. The CPU's instruction prefetcher can stream this code sequentially, and the entire hot loop fits in L1i.
Important
For real-time AO loops running at 1–10 kHz, icache pressure is the dominant performance bottleneck after algorithmic optimization. Reducing the hot-path footprint from scattered .so pages to a compact inlined binary is one of the most impactful optimizations available.
1.5. Summary: Static LTO Benefits¶
| Benefit | Mechanism | Impact |
|---|---|---|
| No PLT indirection | Direct calls replace GOT lookups | Lower per-call latency |
| Cross-library inlining | GCC inlines across .a boundaries | Eliminates call overhead entirely |
| Dead-code elimination | Unreachable code removed at link | Smaller binary, less icache pressure |
| Constant propagation | GCC propagates constants across modules | Simpler generated code |
| Compact hot path | Inlined code is sequential in memory | L1i cache stays warm |
| Zero startup overhead | No ld.so symbol resolution | Faster process launch |
1.6. Build Modes¶
Two approaches are available, depending on whether you need full static linking or just LTO on the executable itself:
Option A — Static LTO (recommended)¶
Builds static archive variants (.a) of every milk library and links them into standalone executables. Gives GCC maximum cross-module visibility.
cd /home/oguyon/src/milk-perf/_build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DUSE_STATIC_LTO=ON \
-DCMAKE_C_FLAGS="-O3 -march=native"
make -j$(nproc)
sudo make install
milk-perfbench build tag: O3 LTO-static [x86_64]
Option B — Dynamic LTO (manual flags)¶
Keeps the default dynamic .so linking but passes -flto explicitly. LTO operates only within the executable's own compilation units — cross-library inlining is not available, but the executable's hot path is still LTO-optimized.
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DUSE_STATIC_LTO=OFF \
-DCMAKE_C_FLAGS="-O3 -march=native -flto=auto" \
-DCMAKE_EXE_LINKER_FLAGS="-flto=auto -Wl,-O2" \
-DCMAKE_SHARED_LINKER_FLAGS="-flto=auto"
make -j$(nproc)
sudo make install
milk-perfbench build tag: O3 LTO [x86_64]
Important
Always pass -DUSE_STATIC_LTO=OFF explicitly when switching to Option B. CMake caches values between runs — if USE_STATIC_LTO=ON was set previously, it remains active until explicitly cleared. Forgetting this causes a link error: cannot find -lImageStreamIO_static.
Restore Normal Build¶
After any optimization build, clear flags so subsequent builds are unaffected:
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DUSE_STATIC_LTO=OFF \
-DCMAKE_C_FLAGS="" \
-DCMAKE_EXE_LINKER_FLAGS="" \
-DCMAKE_SHARED_LINKER_FLAGS=""
1.7. Verifying Static Linking¶
Check that a standalone has minimal dynamic dependencies:
$ ldd /usr/local/bin/milk-fpsexec-arith-crop2D
## Default: 14 shared libs (milkfps.so, ImageStreamIO.so, ...)
## Static LTO: 3 deps (libc, ld-linux, vdso)
1.8. Verifying the Build Mode¶
Every fpsexec binary embeds a build-tag sentinel string that can be inspected with strings(1):
$ strings milk-fpsexec-imggen-mkrandom | grep 'MILK_BUILD:'
MILK_BUILD:VER=1,...,ARCH=x86_64,OPT=3,LTO=STATIC,END
milk-perfbench reads and displays this automatically in its header line:
Possible Build: values:
| Shown | Meaning |
|---|---|
default (no PGO/LTO) | Plain Release build |
O3 [x86_64] | -O3, no LTO |
O3 LTO [x86_64] | Option B (dynamic LTO) |
O3 LTO-static [x86_64] | Option A (static LTO) |
O3 PGO [x86_64] | PGO pass-2, no LTO |
O3 PGO LTO [x86_64] | PGO + dynamic LTO |
O3 PGO LTO-static [x86_64] | PGO + static LTO (maximum) |
2. Profile-Guided Optimization (PGO)¶
PGO trains GCC with real runtime profiles to optimize branch prediction, function layout, and inlining. Typical speedups: 10–30% on branch-heavy real-time loops.
2.1. Quick Start¶
$ cd _build
## Step 1 — Instrument
$ cmake .. -DUSE_PGO=GENERATE
$ make -j$(nproc) && sudo make install
## Step 2 — Run representative workloads
$ milk-fpsexec-streamcopy -n scopy01
$ # ... exercise your typical AO loop patterns
## Step 3 — Rebuild with profiles
$ cmake .. -DUSE_PGO=USE
$ make -j$(nproc) && sudo make install
2.2. How It Works¶
| Step | CMake Flag | GCC Flags | Effect |
|---|---|---|---|
| 1 | -DUSE_PGO=GENERATE | -fprofile-generate | Emits .gcda profile data at runtime |
| 2 | (run workload) | — | Collects branch/call counts |
| 3 | -DUSE_PGO=USE | -fprofile-use -fprofile-correction | Optimizes using collected data |
2.3. Per-Executable Profile Isolation¶
Each standalone binary gets its own profile subdirectory under _build/pgo/:
_build/pgo/
├── shared/ ← shared libs
├── milk-fpsexec-streamcopy/ ← streamcopy
├── milk-fpsexec-linalg-SGEMM/ ← SGEMM
├── cacao-fpsexec-cacaoloop-WFS/ ← WFS
└── ...
This isolation is automatic — the milk_pgo_target() CMake helper (called by add_milk_standalone() / add_cacao_standalone()) sets per-target -fprofile-dir.
2.4. Optimizing Specific Executables¶
$ cd _build
## Step 1 — Build everything with instrumentation
$ cmake .. -DUSE_PGO=GENERATE
$ make -j$(nproc) && sudo make install
## Step 2 — Run ONLY the executables you want to
## optimize, with realistic workloads
$ milk-fpsexec-streamcopy -n scopy01
$ # let it process several thousand frames, then ^C
$ cacao-fpsexec-cacaoloop-WFS -n wfs01
$ # exercise another workload
## Step 3 — Rebuild with profiles
$ cmake .. -DUSE_PGO=USE
$ make -j$(nproc) && sudo make install
Only the executables exercised in Step 2 receive PGO optimization. Others compile normally — GCC silently ignores missing profiles when -fprofile-correction is set.
| Component | Profile directory | Scope |
|---|---|---|
Standalone .c | pgo/<exe-name>/ | Independent per executable |
| Shared libraries | pgo/shared/ | Aggregated across all runs |
Tip
For the best results, run each fpsexec with a workload that closely matches production use: same stream sizes, same number of modes, same loop rate. The more representative the training run, the better the optimization.
3. Combined PGO + Static LTO¶
For maximum performance, combine both techniques. Static LTO makes library code visible to PGO's profile-guided optimizer, amplifying both effects:
$ mkdir _build_opt && cd _build_opt
## Step 1 — Instrument with static LTO
$ cmake .. -DUSE_PGO=GENERATE -DUSE_STATIC_LTO=ON
$ make -j$(nproc) && sudo make install
## Step 2 — Run representative workloads
$ cacao-fpsexec-dmcomb -n dmcomb01
$ # exercise your production AO loop
## Step 3 — Rebuild with profiles + LTO
$ cmake .. -DUSE_PGO=USE -DUSE_STATIC_LTO=ON
$ make -j$(nproc) && sudo make install
Why They Complement Each Other¶
graph LR
subgraph "Without Static LTO"
A1["fpsexec.c"] --> B1["GCC sees<br/>1 file"]
C1["libmilkfps.so"] -.-> |"opaque"| B1
end
subgraph "With Static LTO"
A2["fpsexec.c"] --> B2["GCC sees<br/>all code"]
C2["libmilkfps.a"] --> B2
D2["libImageStreamIO.a"] --> B2
end
subgraph "With PGO + Static LTO"
A3["fpsexec.c"] --> B3["GCC sees<br/>all code +<br/>runtime data"]
C3[".a archives"] --> B3
E3[".gcda profiles"] --> B3
end | Optimization | Scope | What it does |
|---|---|---|
| LTO alone | Cross-module | Inlines library calls, removes dead code |
| PGO alone | Per-module | Optimizes branch layout from runtime data |
| PGO + LTO | Cross-module + runtime | Inlines AND profile-optimizes across all libraries |
PGO needs to see the function bodies to optimize them. Static LTO makes library function bodies visible. Together, PGO can profile-optimize code paths that span fpsexec.c → ImageStreamIO → milkfps — the entire hot path of a real-time loop becomes a single optimization unit.
4. Dual Library Architecture¶
The milk build system compiles two variants of every library to support both the interactive CLI and standalone fpsexec executables:
4.1. Shared Libraries (.so) — for CLI¶
- Linked by
milk-cliand module shared libraries - Contain full CLI registration code (
RegisterModule,RegisterCLIcommand, etc.) - Loaded at runtime via
dlopen()for module hot-loading
4.2. Compute Libraries (_compute.so) — for¶
Standalones
- Compiled with
-DMILK_NO_CLI— pure computation - No dependency on CLIcore — CLI registration stubs are excluded
- Linked by
milk-fpsexec-*/cacao-fpsexec-*
4.3. Static Archives (.a) — for Static LTO¶
When USE_STATIC_LTO=ON, a third variant is built for each library:
libImageStreamIO.a
libmilkfps.a
libmilkfpsStandalone.a
libmilkdata.a
libmilkprocessinfo.a
libCOREMODmemory_compute.a
libCOREMODarith_compute.a
libCOREMODtools_compute.a
libCOREMODiofits_compute.a
- Static archives contain the same
.ofiles as_compute.so, but archived for static linking - GCC can look inside
.afiles at link time, enabling cross-module LTO optimization - Only linked into standalone executables — shared libraries and CLI are unaffected
4.4. Architecture Diagram¶
┌─────────────────────────────────────┐
│ milk-cli │
│ (interactive shell, module loader) │
│ Links: .so libraries (dynamic) │
└─────────────┬───────────────────────┘
│
┌─────────┴─────────┐
│ Module .so libs │
│ (CLIcore-linked) │
└────────────────────┘
┌─────────────────────────────────────┐
│ milk-fpsexec-* / cacao-fpsexec-* │
│ (standalone compute units) │
│ │
│ Default: links _compute.so (dyn) │
│ Static LTO: links .a (static) │
│ + PGO: adds profiling/use flags │
└─────────────┬───────────────────────┘
│
┌─────────┴──────────────────────┐
│ _compute variants │
│ (MILK_NO_CLI, no CLIcore) │
│ │
│ .so → default dynamic link │
│ .a → static LTO link │
└────────────────────────────────┘
5. CMake Build Options Summary¶
| Option | Default | Effect |
|---|---|---|
USE_PGO | (off) | GENERATE or USE — profile-guided optimization |
USE_STATIC_LTO | OFF | Static archives + LTO for standalones |
PGO_DIR | _build/pgo/ | Profile data directory |
Build configurations:
## Default (shared libs, no LTO)
cmake .. -DUSE_STATIC_LTO=OFF
## Option A — Static LTO only
cmake .. \
-DUSE_STATIC_LTO=ON \
-DCMAKE_C_FLAGS="-O3 -march=native"
## Option B — Dynamic LTO (manual flags)
cmake .. \
-DUSE_STATIC_LTO=OFF \
-DCMAKE_C_FLAGS="-O3 -march=native -flto=auto" \
-DCMAKE_EXE_LINKER_FLAGS="-flto=auto -Wl,-O2" \
-DCMAKE_SHARED_LINKER_FLAGS="-flto=auto"
## PGO only
cmake .. -DUSE_PGO=GENERATE # step 1
cmake .. -DUSE_PGO=USE # step 3
## Maximum optimization (PGO + static LTO)
cmake .. -DUSE_STATIC_LTO=ON -DUSE_PGO=USE
## Restore normal build (clear all flags)
cmake .. \
-DUSE_STATIC_LTO=OFF \
-DCMAKE_C_FLAGS="" \
-DCMAKE_EXE_LINKER_FLAGS="" \
-DCMAKE_SHARED_LINKER_FLAGS=""
Danger
CMake caches all -D options between runs. Always pass -DUSE_STATIC_LTO=OFF explicitly when switching away from static LTO. Omitting it leaves USE_STATIC_LTO=ON in the cache and causes cannot find -lImageStreamIO_static.
CMake Policy¶
Add cmake_policy(SET CMP0069 NEW) to any CMakeLists.txt that calls add_library() to suppress the INTERPROCEDURAL_OPTIMIZATION policy warning when -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON is set. This is already applied to CLIcore.
6. Manual CMake Flags (Dynamic Libs)¶
This is Option B from section 1.6 — applying PGO on top of dynamic-lib LTO when USE_STATIC_LTO is not used.
7. Notes¶
- Profile data (
.gcdafiles) is written toPGO_DIR(default:_build/pgo/). Override with-DPGO_DIR=/path/to/profiles. -fprofile-correctionhandles minor mismatches from multi-threaded execution and missing profiles.- Re-run the full PGO cycle whenever you make significant code changes.
- To disable PGO, omit the
-DUSE_PGOflag (or set it to empty). - Static LTO increases binary sizes (2–3×) since library code is embedded — this is expected.
- Build time with static LTO is longer due to whole-program optimization at link time.
- CMake cache: always pass
-DUSE_STATIC_LTO=OFFwhen switching back to dynamic builds. CachedONcausescannot find -lXxx_staticerrors. - Build tag: every fpsexec embeds a
MILK_BUILD:sentinel string in.rodatareadable viastrings | grep MILK_BUILD:. Themilk-perfbenchheader reports this as theBuild:line, allowing unambiguous verification that the right optimization level was applied.
8. Fully Static Binaries with musl libc¶
For maximum portability — deploying a single self-contained binary to a target machine without installing any runtime libraries — you can build fpsexec executables against musl libc instead of glibc.
Note
The standard static LTO build (section 1.6 Option A) still depends on the system glibc at runtime (3 libs: libc.so, ld-linux.so, libm.so). The musl build here produces a true zero-dependency binary.
8.1. Prerequisites¶
Verify the toolchain is present:
8.2. Build a Static musl Binary¶
cd /path/to/milk-perf
mkdir -p _build_musl && cd _build_musl
cmake .. \
-DCMAKE_C_COMPILER=musl-gcc \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_FLAGS="-O3 -march=native -D_GNU_SOURCE \
-I/path/to/milk-perf/src/coremods \
-I/path/to/milk-perf/src" \
-DCMAKE_EXE_LINKER_FLAGS="-static" \
-DUSE_STATIC_LTO=ON \
-DUSE_CFITSIO=OFF \
-DUSE_CLI=OFF \
-DUSE_NCURSES=OFF \
-DUSE_READLINE=OFF \
-DUSE_OPENBLAS=OFF \
-DBUILD_SHARED_LIBS=OFF
## Build a specific standalone executable
make milk-fpsexec-imggen-mkrandom -j$(nproc)
Replace /path/to/milk-perf with the absolute path to your source tree. The -D_GNU_SOURCE flag is required to expose cpu_set_t and thread affinity APIs; the extra include paths expose COREMOD_memory headers needed by plugins built with -DUSE_CLI=OFF.
8.3. Verify the Binary is Fully Static¶
$ ldd _build_musl/plugins/.../milk-fpsexec-imggen-mkrandom
not a dynamic executable
$ file milk-fpsexec-imggen-mkrandom
ELF 64-bit LSB executable, x86-64,
statically linked, with debug_info, not stripped
$ ls -lh milk-fpsexec-imggen-mkrandom
291K
No shared library dependencies — deploy by copying the single binary file.
8.4. Install¶
Copy to system bin:
Copy to user bin (no sudo):
Deploy to a remote machine:
Because the binary is fully self-contained, no library installation is needed on the target.
8.5. Required CMake Flags and Why¶
| Flag | Value | Reason |
|---|---|---|
CMAKE_C_COMPILER | musl-gcc | Wrapper that redirects includes/libs to musl |
CMAKE_EXE_LINKER_FLAGS | -static | Tells the linker to produce a static binary |
USE_STATIC_LTO | ON | Builds .a archives; required by -static |
USE_CFITSIO | OFF | System cfitsio is glibc-linked; incompatible with musl static |
USE_CLI | OFF | CLI requires ncurses/readline dynamic libs |
USE_NCURSES | OFF | ncurses has no musl static variant by default |
USE_READLINE | OFF | Same as ncurses |
USE_OPENBLAS | OFF | System OpenBLAS is glibc-linked |
BUILD_SHARED_LIBS | OFF | Prevents cmake from building .so targets that would fail the static link |
-D_GNU_SOURCE (C flag) | set | Exposes cpu_set_t, pthread_setaffinity_np and other POSIX extensions in musl |
8.6. Known Warnings (Non-Fatal)¶
_GNU_SOURCE redefined:
fps_standalone_data.c defines _GNU_SOURCE internally; passing it as a CMake flag causes a harmless redefinition. No action required.
LTO type mismatch on copy_image_ID:
warning: type of 'copy_image_ID' does not match original declaration [-Wlto-type-mismatch]
image_copy.h: return value: imageID vs int
fps_loadmemstream_lite.c declares copy_image_ID as int while image_copy.h uses typedef imageID. This is a pre-existing mismatch in the codebase (not introduced by the musl build). The binary is correct because imageID is typedef long which matches GCC's int ABI on the return path. No runtime impact.
8.7. Limitations Compared to Standard Static LTO¶
| Feature | Standard static LTO | musl static |
|---|---|---|
| CFITSIO support | ✓ | ✗ (glibc-only) |
| OpenBLAS/MKL | ✓ | ✗ |
| ncurses TUI | ✓ | ✗ |
| Dynamic library deps | 3 (libc family) | 0 |
| Deploy without runtime | ✗ | ✓ |
| Binary portability | Same glibc version | Any Linux x86-64 |
| PGO compatible | ✓ | ✓ (same workflow) |
8.8. musl vs glibc strerror_r ABI¶
musl implements the POSIX (XSI) variant of strerror_r which returns int, whereas glibc defaults to the GNU variant returning char *. The milk source correctly detects this via #ifndef __GLIBC__ in ImageStreamIO.c — no user action needed.