perf(1)

perf list show supported hw/sw events & metrics -v ........ print longer event descriptions --details . print information on the perf event names and expressions used internally by events perf stat -p <pid> ..... show stats for running process -o <file> .... write output to file (default stderr) -I <ms> ...... show stats periodically over interval <ms> -e <ev> ...... select event(s) -M <met> ..... print metric(s), this adds the metric events --all-user ... configure all selected events for user space --all-kernel . configure all selected events for kernel space perf top -p <pid> .. show stats for running process -F <hz> ... sampling frequency -K ........ hide kernel threads perf record -p <pid> ............... record stats for running process -o <file> .............. write output to file (default perf.data) -F <hz> ................ sampling frequency --call-graph <method> .. [fp, dwarf, lbr] method how to caputre backtrace fp : use frame-pointer, need to compile with -fno-omit-frame-pointer dwarf: use .cfi debug information lbr : use hardware last branch record facility -g ..................... short-hand for --call-graph fp -e <ev> ................ select event(s) --all-user ............. configure all selected events for user space --all-kernel ........... configure all selected events for kernel space -M intel ............... use intel disassembly in annotate perf report -n .................... annotate symbols with nr of samples --stdio ............... report to stdio, if not presen tui mode -g graph,0.5,callee ... show callee based call chains with value >0.5
Useful <ev>: page-faults minor-faults major-faults cpu-cycles` task-clock

Select specific events

Events to sample are specified with the -e option, either pass a comma separated list or pass -e multiple times.

Events are specified in the following form name[:modifier]. The list and description of the modifier can be found in the perf-list(1) manpage under EVENT MODIFIERS.

# L1 i$ misses in user space # L2 i$ stats in user/kernel space mixed # Sample specified events. perf stat -e L1-icache-load-misses:u \ -e l2_rqsts.all_code_rd:uk,l2_rqsts.code_rd_hit:k,l2_rqsts.code_rd_miss:k \ -- stress -c 2

The --all-user and --all-kernel options append a :u and :k modifier to all specified events. Therefore the following two command lines are equivalent.

# 1) perf stat -e cycles:u,instructions:u -- ls # 2) perf stat --all-user -e cycles,instructions -- ls

Raw events

In case perf does not provide a symbolic name for an event, the event can be specified in a raw form as r + UMask + EventCode.

The following is an example for the L2_RQSTS.CODE_RD_HIT event with EventCode=0x24 and UMask=0x10 on my laptop with a sandybridge uarch.

perf stat -e l2_rqsts.code_rd_hit -e r1024 -- ls # Performance counter stats for 'ls': # # 33.942 l2_rqsts.code_rd_hit # 33.942 r1024

Find raw performance counter events (intel)

The intel/perfmon repository provides a performance event databases for the different intel uarchs.

The table in mapfile.csv can be used to lookup the corresponding uarch, just grab the family model from the procfs.

cat /proc/cpuinfo | awk '/^vendor_id/ { V=$3 } /^cpu family/ { F=$4 } /^model\s*:/ { printf "%s-%d-%x\n",V,F,$3 }'

The table in performance monitoring events describes how events are sorted into the different files.

Raw events for perfs own symbolic names

Perf also defines some own symbolic names for events. An example is the cache-references event. The perf_event_open(2) manpage gives the following description.

perf_event_open(2) PERF_COUNT_HW_CACHE_REFERENCES Cache accesses. Usually this indicates Last Level Cache accesses but this may vary depending on your CPU. This may include prefetches and coherency messages; again this depends on the design of your CPU.

The sysfs can be consulted to get the concrete performance counter on the given system.

cat /sys/devices/cpu/events/cache-misses # event=0x2e,umask=0x41

Flamegraph

Flamegraph with single event trace

perf record -g -e cpu-cycles -p <pid> perf script | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > cycles-flamegraph.svg

Flamegraph with multiple event traces

perf record -g -e cpu-cycles,page-faults -p <pid> perf script --per-event-dump # fold & generate as above

Examples

Estimate max instructions per cycle

#define NOP4 "nop\nnop\nnop\nnop\n" #define NOP32 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4 NOP4 #define NOP256 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32 NOP32 #define NOP2048 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 NOP256 int main() { for (unsigned i = 0; i < 2000000; ++i) { asm volatile(NOP2048); } }
perf stat -e cycles,instructions ./noploop # Performance counter stats for './noploop': # # 1.031.075.940 cycles # 4.103.534.341 instructions # 3,98 insn per cycle

Caller vs callee callstacks

The following gives an example for a scenario where we have the following calls

  • main -> do_foo() -> do_work()
  • main -> do_bar() -> do_work()
perf report --stdio -g graph,caller # Children Self Command Shared Object Symbols # ........ ........ ....... .................... ................. # # 49.71% 49.66% bench bench [.] do_work # | # --49.66%--_start <- callstack bottom # __libc_start_main # 0x7ff366c62ccf # main # | # |--25.13%--do_bar # | do_work <- callstack top # | # --24.53%--do_foo # do_work perf report --stdio -g graph,callee # Children Self Command Shared Object Symbols # ........ ........ ....... .................... ................. # # 49.71% 49.66% bench bench [.] do_work # | # ---do_work <- callstack top # | # |--25.15%--do_bar # | main # | 0x7ff366c62ccf # | __libc_start_main # | _start <- callstack bottom # | # --24.55%--do_foo # main # 0x7ff366c62ccf # __libc_start_main # _start <- callstack bottom

References