Profile-Guided Optimization (PGO) & Link-Time¶
Optimization (LTO)¶
The milk build system supports two complementary
compiler optimization techniques for fpsexec
standalone executables:
| Technique | CMake Flag | Typical Speedup |
|---|---|---|
| PGO | -DUSE_PGO=GENERATE/USE |
10–30% |
| LTO | -DUSE_STATIC_LTO=ON |
5–15% |
| PGO + LTO | Both | 15–40% |
[!TIP] For maximum performance on production AO systems, enable both PGO and static LTO. The techniques are complementary — LTO exposes cross-library code to GCC, and PGO then trains the optimizer with real branch/call data across that larger scope.
1. Link-Time Optimization (LTO)¶
1.1. What LTO Does¶
LTO (-flto=auto) allows GCC to optimize across
translation-unit boundaries. Instead of compiling
each .c file in isolation, GCC serializes its
intermediate representation (GIMPLE IR) into object
files and performs whole-program optimization at
link time.
Without static LTO, standalone executables link
shared libraries (.so), which are opaque — the
linker cannot see inside them:
fpsexec.c → fpsexec.o → fpsexec
↓ (dynamic)
libmilkfps.so ← opaque to LTO
libImageStreamIO.so ← opaque
With static LTO (USE_STATIC_LTO=ON), the same
libraries are linked as static archives (.a).
GCC can now inline across all library boundaries:
fpsexec.c → fpsexec.o ──┐
libmilkfps.a ──────────────┼→ LTO link → fpsexec
libImageStreamIO.a ────────┤ (full visibility)
libCOREMODmemory_compute.a ─┘
1.2. Why Static Linking Is Faster¶
Statically linked executables eliminate several layers of runtime overhead present in dynamically linked binaries:
PLT/GOT elimination. Every call to a shared
library function goes through the Procedure Linkage
Table (PLT) and Global Offset Table (GOT) — an
indirect jump that the CPU branch predictor cannot
fully resolve. In a tight real-time loop calling
ImageStreamIO_sempost() or fps_to_local() at
tens of kHz, these indirect jumps add measurable
latency. Static linking replaces them with direct
calls or inlined code.
No dynamic loader overhead. At startup, ld.so
must resolve all symbol relocations, map .so
pages, and apply RELRO protections. Standalone
executables with 14 shared libraries pay this cost
on every launch. A statically linked binary is
ready to execute immediately — important for rapid
fault recovery in production AO loops.
Improved branch prediction. Direct calls from static linking have fixed target addresses known at link time. The CPU's branch target buffer (BTB) can predict these perfectly after the first execution, whereas PLT stubs pollute the BTB with indirect entries.
1.3. How LTO Keeps Static Binaries Small¶
A naïve static link pulls in every object file
from each .a archive — even functions the
executable never calls. This would bloat binaries
unacceptably. LTO solves this:
Dead-code elimination. With -flto=auto, GCC
sees the entire program (executable + all static
archives) as a single optimization unit. It traces
all reachable call paths from main() and
discards every function, variable, and data
structure that is not reachable. Entire
translation units that are unused vanish from the
final binary.
Cross-module inlining + elimination. LTO first
inlines small hot functions across library
boundaries (e.g., ImageStreamIO_sempost() into
your compute loop). After inlining, the original
library function may have zero remaining callers —
LTO then eliminates it entirely. The net effect:
the binary contains only the machine code that
actually executes, inlined at the call sites where
it's needed.
Measured example:
| Binary | Dynamic | Static+LTO | Ratio |
|---|---|---|---|
arith-crop2D |
52 KB (.so deps: ~2.4 MB) | 173 KB (self-contained) | 3.3× larger on disk |
The static binary is 3.3× larger than the dynamic
stub, but far smaller than the total code footprint
of 14 shared libraries (2.4 MB mapped in
aggregate). LTO stripped the vast majority of
library code that crop2D never calls.
1.4. Why Small Binaries Run Faster¶
Modern CPUs execute code from the instruction cache (L1i), typically 32–64 KB. When the executable's hot path fits in icache, the CPU never stalls waiting for instruction fetches from L2/L3:
┌──────────────────────────────────────┐
│ L1 Instruction Cache (32-64 KB) │
│ │
│ Dynamic link: │
│ fpsexec code + PLT stubs │
│ + scattered .so code pages │
│ → icache thrashing, frequent misses │
│ │
│ Static LTO: │
│ fpsexec code + inlined hot paths │
│ → compact, sequential, cache-warm │
└──────────────────────────────────────┘
Dynamic linking scatters hot code. The compute
function lives in the executable, but
sem_post() is in libpthread.so,
ImageStreamIO_sempost() is in
libImageStreamIO.so, and fps_to_local() is in
libmilkfps.so. Each call jumps to a different
memory region, evicting other hot code from icache.
Static LTO consolidates hot code. After inlining, the compute function, semaphore operations, and FPS parameter access are all compiled into a single contiguous code region. The CPU's instruction prefetcher can stream this code sequentially, and the entire hot loop fits in L1i.
[!IMPORTANT] For real-time AO loops running at 1–10 kHz, icache pressure is the dominant performance bottleneck after algorithmic optimization. Reducing the hot-path footprint from scattered
.sopages to a compact inlined binary is one of the most impactful optimizations available.
1.5. Summary: Static LTO Benefits¶
| Benefit | Mechanism | Impact |
|---|---|---|
| No PLT indirection | Direct calls replace GOT lookups | Lower per-call latency |
| Cross-library inlining | GCC inlines across .a boundaries |
Eliminates call overhead entirely |
| Dead-code elimination | Unreachable code removed at link | Smaller binary, less icache pressure |
| Constant propagation | GCC propagates constants across modules | Simpler generated code |
| Compact hot path | Inlined code is sequential in memory | L1i cache stays warm |
| Zero startup overhead | No ld.so symbol resolution |
Faster process launch |
1.6. Usage¶
$ mkdir _build_lto && cd _build_lto
$ cmake .. -DUSE_STATIC_LTO=ON
$ make -j$(nproc) && sudo make install
[!NOTE] Static LTO only affects
fpsexecstandalone executables. Shared libraries and themilk-clibinary are unchanged.
1.7. Verifying Static Linking¶
Check that a standalone has minimal dynamic dependencies:
$ ldd /usr/local/bin/milk-fpsexec-arith-crop2D
## Default: 14 shared libs (milkfps.so, ImageStreamIO.so, ...)
## Static LTO: 3 deps (libc, ld-linux, vdso)
2. Profile-Guided Optimization (PGO)¶
PGO trains GCC with real runtime profiles to optimize branch prediction, function layout, and inlining. Typical speedups: 10–30% on branch-heavy real-time loops.
2.1. Quick Start¶
$ cd _build
## Step 1 — Instrument
$ cmake .. -DUSE_PGO=GENERATE
$ make -j$(nproc) && sudo make install
## Step 2 — Run representative workloads
$ milk-fpsexec-streamcopy -n scopy01
$ # ... exercise your typical AO loop patterns
## Step 3 — Rebuild with profiles
$ cmake .. -DUSE_PGO=USE
$ make -j$(nproc) && sudo make install
2.2. How It Works¶
| Step | CMake Flag | GCC Flags | Effect |
|---|---|---|---|
| 1 | -DUSE_PGO=GENERATE |
-fprofile-generate |
Emits .gcda profile data at runtime |
| 2 | (run workload) | — | Collects branch/call counts |
| 3 | -DUSE_PGO=USE |
-fprofile-use -fprofile-correction |
Optimizes using collected data |
2.3. Per-Executable Profile Isolation¶
Each standalone binary gets its own profile
subdirectory under _build/pgo/:
_build/pgo/
├── shared/ ← shared libs
├── milk-fpsexec-streamcopy/ ← streamcopy
├── milk-fpsexec-linalg-SGEMM/ ← SGEMM
├── cacao-fpsexec-cacaoloop-WFS/ ← WFS
└── ...
This isolation is automatic — the
milk_pgo_target() CMake helper (called by
add_milk_standalone() / add_cacao_standalone())
sets per-target -fprofile-dir.
2.4. Optimizing Specific Executables¶
$ cd _build
## Step 1 — Build everything with instrumentation
$ cmake .. -DUSE_PGO=GENERATE
$ make -j$(nproc) && sudo make install
## Step 2 — Run ONLY the executables you want to
## optimize, with realistic workloads
$ milk-fpsexec-streamcopy -n scopy01
$ # let it process several thousand frames, then ^C
$ cacao-fpsexec-cacaoloop-WFS -n wfs01
$ # exercise another workload
## Step 3 — Rebuild with profiles
$ cmake .. -DUSE_PGO=USE
$ make -j$(nproc) && sudo make install
Only the executables exercised in Step 2 receive
PGO optimization. Others compile normally — GCC
silently ignores missing profiles when
-fprofile-correction is set.
| Component | Profile directory | Scope |
|---|---|---|
Standalone .c |
pgo/<exe-name>/ |
Independent per executable |
| Shared libraries | pgo/shared/ |
Aggregated across all runs |
[!TIP] For the best results, run each fpsexec with a workload that closely matches production use: same stream sizes, same number of modes, same loop rate. The more representative the training run, the better the optimization.
3. Combined PGO + Static LTO¶
For maximum performance, combine both techniques. Static LTO makes library code visible to PGO's profile-guided optimizer, amplifying both effects:
$ mkdir _build_opt && cd _build_opt
## Step 1 — Instrument with static LTO
$ cmake .. -DUSE_PGO=GENERATE -DUSE_STATIC_LTO=ON
$ make -j$(nproc) && sudo make install
## Step 2 — Run representative workloads
$ cacao-fpsexec-dmcomb -n dmcomb01
$ # exercise your production AO loop
## Step 3 — Rebuild with profiles + LTO
$ cmake .. -DUSE_PGO=USE -DUSE_STATIC_LTO=ON
$ make -j$(nproc) && sudo make install
Why They Complement Each Other¶
graph LR
subgraph "Without Static LTO"
A1["fpsexec.c"] --> B1["GCC sees<br/>1 file"]
C1["libmilkfps.so"] -.-> |"opaque"| B1
end
subgraph "With Static LTO"
A2["fpsexec.c"] --> B2["GCC sees<br/>all code"]
C2["libmilkfps.a"] --> B2
D2["libImageStreamIO.a"] --> B2
end
subgraph "With PGO + Static LTO"
A3["fpsexec.c"] --> B3["GCC sees<br/>all code +<br/>runtime data"]
C3[".a archives"] --> B3
E3[".gcda profiles"] --> B3
end
| Optimization | Scope | What it does |
|---|---|---|
| LTO alone | Cross-module | Inlines library calls, removes dead code |
| PGO alone | Per-module | Optimizes branch layout from runtime data |
| PGO + LTO | Cross-module + runtime | Inlines AND profile-optimizes across all libraries |
PGO needs to see the function bodies to
optimize them. Static LTO makes library function
bodies visible. Together, PGO can profile-optimize
code paths that span fpsexec.c →
ImageStreamIO → milkfps — the entire hot path
of a real-time loop becomes a single optimization
unit.
4. Dual Library Architecture¶
The milk build system compiles two variants of
every library to support both the interactive CLI
and standalone fpsexec executables:
4.1. Shared Libraries (.so) — for CLI¶
- Linked by
milk-cliand module shared libraries - Contain full CLI registration code
(
RegisterModule,RegisterCLIcommand, etc.) - Loaded at runtime via
dlopen()for module hot-loading
4.2. Compute Libraries (_compute.so) — for¶
Standalones
- Compiled with
-DMILK_NO_CLI— pure computation - No dependency on CLIcore — CLI registration stubs are excluded
- Linked by
milk-fpsexec-*/cacao-fpsexec-*
4.3. Static Archives (.a) — for Static LTO¶
When USE_STATIC_LTO=ON, a third variant is
built for each library:
libImageStreamIO.a
libmilkfps.a
libmilkfpsStandalone.a
libmilkdata.a
libmilkprocessinfo.a
libCOREMODmemory_compute.a
libCOREMODarith_compute.a
libCOREMODtools_compute.a
libCOREMODiofits_compute.a
- Static archives contain the same
.ofiles as_compute.so, but archived for static linking - GCC can look inside
.afiles at link time, enabling cross-module LTO optimization - Only linked into standalone executables — shared libraries and CLI are unaffected
4.4. Architecture Diagram¶
┌─────────────────────────────────────┐
│ milk-cli │
│ (interactive shell, module loader) │
│ Links: .so libraries (dynamic) │
└─────────────┬───────────────────────┘
│
┌─────────┴─────────┐
│ Module .so libs │
│ (CLIcore-linked) │
└────────────────────┘
┌─────────────────────────────────────┐
│ milk-fpsexec-* / cacao-fpsexec-* │
│ (standalone compute units) │
│ │
│ Default: links _compute.so (dyn) │
│ Static LTO: links .a (static) │
│ + PGO: adds profiling/use flags │
└─────────────┬───────────────────────┘
│
┌─────────┴──────────────────────┐
│ _compute variants │
│ (MILK_NO_CLI, no CLIcore) │
│ │
│ .so → default dynamic link │
│ .a → static LTO link │
└────────────────────────────────┘
5. CMake Build Options Summary¶
| Option | Default | Effect |
|---|---|---|
USE_PGO |
(off) | GENERATE or USE — profile-guided optimization |
USE_STATIC_LTO |
OFF |
Static archives + LTO for standalones |
PGO_DIR |
_build/pgo/ |
Profile data directory |
Build configurations:
## Default (dynamic, LTO within modules)
$ cmake ..
## Static LTO only
$ cmake .. -DUSE_STATIC_LTO=ON
## PGO only
$ cmake .. -DUSE_PGO=GENERATE # step 1
$ cmake .. -DUSE_PGO=USE # step 3
## Maximum optimization
$ cmake .. -DUSE_STATIC_LTO=ON -DUSE_PGO=USE
6. Notes¶
- Profile data (
.gcdafiles) is written toPGO_DIR(default:_build/pgo/). Override with-DPGO_DIR=/path/to/profiles. -fprofile-correctionhandles minor mismatches from multi-threaded execution and missing profiles.- Re-run the full PGO cycle whenever you make significant code changes.
- To disable PGO, omit the
-DUSE_PGOflag (or set it to empty). - Static LTO increases binary sizes (2–3×) since library code is embedded — this is expected.
- Build time with static LTO is longer due to whole-program optimization at link time.