Post

gcc的optimize flags

简单的记录一下gcc的优化选项,以及一些细节。

正常情况下,能选择开/关的编译器优化,只有有符号的哪些

你可以通过

1
2
3
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled

看哪些优化被开启了,O2和O3的区别。

要注意一点,debug的编译尽量使用-Og或者使用-O1 or -O0, 不要让inline进入到你的debug编译,这样的坏处是断点的时候会出现很奇怪的跳转,代码对不准,具体分析问题可能要看汇编了

默认优化-O0

O0是默认的优化选项,理论上是不进行任何优化,但是在查阅资料之后发现也有一些优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
  -faggressive-loop-optimizations       [enabled]
  -fallocation-dce                      [enabled]
  -fasynchronous-unwind-tables          [enabled]
  -fauto-inc-dec                        [enabled]
  -fbit-tests                           [enabled]
  -fdce                                 [enabled]
  -fearly-inlining                      [enabled]
  -ffp-int-builtin-inexact              [enabled]
  -ffunction-cse                        [enabled]
  -fgcse-lm                             [enabled]
  -finline-atomics                      [enabled]
  -fipa-stack-alignment                 [enabled]
  -fipa-strict-aliasing                 [enabled]
  -fira-hoist-pressure                  [enabled]
  -fira-share-save-slots                [enabled]
  -fira-share-spill-slots               [enabled]
  -fivopts                              [enabled]
  -fjump-tables                         [enabled]
  -flifetime-dse                        [enabled]
  -fmath-errno                          [enabled]
  -fpeephole                            [enabled]
  -fplt                                 [enabled]
  -fprintf-return-value                 [enabled]
  -freg-struct-return                   [enabled]
  -fsched-critical-path-heuristic       [enabled]
  -fsched-dep-count-heuristic           [enabled]
  -fsched-group-heuristic               [enabled]
  -fsched-interblock                    [enabled]
  -fsched-last-insn-heuristic           [enabled]
  -fsched-rank-heuristic                [enabled]
  -fsched-spec                          [enabled]
  -fsched-spec-insn-heuristic           [enabled]
  -fsched-stalled-insns-dep             [enabled]
  -fschedule-fusion                     [enabled]
  -fsemantic-interposition              [enabled]
  -fshort-enums                         [enabled]
  -fshrink-wrap-separate                [enabled]
  -fsigned-zeros                        [enabled]
  -fsplit-ivs-in-unroller               [enabled]
  -fssa-backprop                        [enabled]
  -fstdarg-opt                          [enabled]
  -ftrapping-math                       [enabled]
  -ftree-forwprop                       [enabled]
  -ftree-loop-im                        [enabled]
  -ftree-loop-ivcanon                   [enabled]
  -ftree-loop-optimize                  [enabled]
  -ftree-phiprop                        [enabled]
  -ftree-reassoc                        [enabled]
  -ftree-scev-cprop                     [enabled]
  -funreachable-traps                   [enabled]
  -funwind-tables                       [enabled]

-O1优化

简单的看下-O1的描述

1
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

实际进行下面的优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
  -faggressive-loop-optimizations       [enabled]
  -fallocation-dce                      [enabled]
  -fasynchronous-unwind-tables          [enabled]
  -fauto-inc-dec                        [enabled]
  -fbit-tests                           [enabled]
  -fbranch-count-reg                    [enabled]
  -fcombine-stack-adjustments           [enabled]
  -fcompare-elim                        [enabled]
  -fcprop-registers                     [enabled]
  -fdce                                 [enabled]
  -fdefer-pop                           [enabled]
  -fdse                                 [enabled]
  -fearly-inlining                      [enabled]
  -fforward-propagate                   [enabled]
  -ffp-int-builtin-inexact              [enabled]
  -ffunction-cse                        [enabled]
  -fgcse-lm                             [enabled]
  -fguess-branch-probability            [enabled]
  -fif-conversion                       [enabled]
  -fif-conversion2                      [enabled]
  -finline                              [enabled]
  -finline-atomics                      [enabled]
  -finline-functions-called-once        [enabled]
  -fipa-modref                          [enabled]
  -fipa-profile                         [enabled]
  -fipa-pure-const                      [enabled]
  -fipa-reference                       [enabled]
  -fipa-reference-addressable           [enabled]
  -fipa-stack-alignment                 [enabled]
  -fipa-strict-aliasing                 [enabled]
  -fira-hoist-pressure                  [enabled]
  -fira-share-save-slots                [enabled]
  -fira-share-spill-slots               [enabled]
  -fivopts                              [enabled]
  -fjump-tables                         [enabled]
  -flifetime-dse                        [enabled]
  -fmath-errno                          [enabled]
  -fmove-loop-invariants                [enabled]
  -fmove-loop-stores                    [enabled]
  -fomit-frame-pointer                  [enabled]
  -fpeephole                            [enabled]
  -fplt                                 [enabled]
  -fprintf-return-value                 [enabled]
  -freg-struct-return                   [enabled]
  -freorder-blocks                      [enabled]
  -fsched-critical-path-heuristic       [enabled]
  -fsched-dep-count-heuristic           [enabled]
  -fsched-group-heuristic               [enabled]
  -fsched-interblock                    [enabled]
  -fsched-last-insn-heuristic           [enabled]
  -fsched-rank-heuristic                [enabled]
  -fsched-spec                          [enabled]
  -fsched-spec-insn-heuristic           [enabled]
  -fsched-stalled-insns-dep             [enabled]
  -fschedule-fusion                     [enabled]
  -fsemantic-interposition              [enabled]
  -fshort-enums                         [enabled]
  -fshrink-wrap                         [enabled]
  -fshrink-wrap-separate                [enabled]
  -fsigned-zeros                        [enabled]
  -fsplit-ivs-in-unroller               [enabled]
  -fsplit-wide-types                    [enabled]
  -fssa-backprop                        [enabled]
  -fssa-phiopt                          [enabled]
  -fstdarg-opt                          [enabled]
  -fthread-jumps                        [enabled]
  -ftoplevel-reorder                    [enabled]
  -ftrapping-math                       [enabled]
  -ftree-bit-ccp                        [enabled]
  -ftree-builtin-call-dce               [enabled]
  -ftree-ccp                            [enabled]
  -ftree-ch                             [enabled]
  -ftree-coalesce-vars                  [enabled]
  -ftree-copy-prop                      [enabled]
  -ftree-dce                            [enabled]
  -ftree-dominator-opts                 [enabled]
  -ftree-dse                            [enabled]
  -ftree-forwprop                       [enabled]
  -ftree-fre                            [enabled]
  -ftree-loop-im                        [enabled]
  -ftree-loop-ivcanon                   [enabled]
  -ftree-loop-optimize                  [enabled]
  -ftree-phiprop                        [enabled]
  -ftree-pta                            [enabled]
  -ftree-reassoc                        [enabled]
  -ftree-scev-cprop                     [enabled]
  -ftree-sink                           [enabled]
  -ftree-slsr                           [enabled]
  -ftree-sra                            [enabled]
  -ftree-ter                            [enabled]
  -funwind-tables                       [enabled]

-O2优化

O2相较于O1,他的描述显得激进了一点

1
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

在O1的基础上, O2还做了下面的这些优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
<   -falign-functions                           [enabled]
<   -falign-jumps                               [enabled]
<   -falign-labels                              [enabled]
<   -falign-loops                               [enabled]
<   -fcaller-saves                              [enabled]
<   -fcode-hoisting                             [enabled]
<   -fcrossjumping                              [enabled]
<   -fcse-follow-jumps                          [enabled]
<   -fdevirtualize                              [enabled]
<   -fdevirtualize-speculatively                [enabled]
<   -fexpensive-optimizations                   [enabled]
<   -fgcse                                      [enabled]
<   -fhoist-adjacent-loads                      [enabled]
<   -findirect-inlining                         [enabled]
<   -finline-functions                          [enabled]
<   -finline-small-functions                    [enabled]
<   -fipa-bit-cp                                [enabled]
<   -fipa-cp                                    [enabled]
<   -fipa-icf                                   [enabled]
<   -fipa-icf-functions                         [enabled]
<   -fipa-icf-variables                         [enabled]
<   -fipa-ra                                    [enabled]
<   -fipa-sra                                   [enabled]
<   -fipa-vrp                                   [enabled]
<   -fisolate-erroneous-paths-dereference       [enabled]
<   -flra-remat                                 [enabled]
<   -foptimize-sibling-calls                    [enabled]
<   -foptimize-strlen                           [enabled]
<   -fpartial-inlining                          [enabled]
<   -fpeephole2                                 [enabled]
<   -free                                       [enabled]
<   -freorder-blocks-and-partition              [enabled]
<   -freorder-functions                         [enabled]
<   -frerun-cse-after-loop                      [enabled]
<   -fschedule-insns2                           [enabled]
<   -fstore-merging                             [enabled]
<   -fstrict-aliasing                           [enabled]
<   -ftree-loop-distribute-patterns             [enabled]
<   -ftree-loop-vectorize                       [enabled]
<   -ftree-pre                                  [enabled]
<   -ftree-slp-vectorize                        [enabled]
<   -ftree-switch-conversion                    [enabled]
<   -ftree-tail-merge                           [enabled]
<   -ftree-vrp                                  [enabled]
<   -funroll-loops                              [enabled]

-O3优化

再额外扩充一下

1
2
3
4
5
6
7
8
9
10
11
12
13
>   -fgcse-after-reload                         [enabled]
>   -fipa-cp-clone                              [enabled]
>   -floop-interchange                          [enabled]
>   -floop-unroll-and-jam                       [enabled]
>   -fpeel-loops                                [enabled]
>   -fpredictive-commoning                      [enabled]
>   -fsplit-loops                               [enabled]
>   -fsplit-paths                               [enabled]
>   -ftree-loop-distribution                    [enabled]
>   -ftree-partial-pre                          [enabled]
>   -funroll-completely-grow-size               [enabled]
>   -funswitch-loops                            [enabled]
>   -fversion-loops-for-strides                 [enabled]

剩下的哪些优化选项就自己后面再看了

clang的优化

Clang 的底层是 LLVM,它的优化机制是基于”Pass 管道(Pass Pipeline)”的。-O2并不是简单地打开一堆布尔开关,而是构建了一条特定的LLVM IR优化Pass处理流。

获取各优化级别的 passes:

1
2
3
4
5
6
7
# 获取 passes (需要有一个 .cpp 文件)
clang -O0 -mllvm -print-pipeline-passes -c main.cpp
clang -O1 -mllvm -print-pipeline-passes -c main.cpp
clang -O2 -mllvm -print-pipeline-passes -c main.cpp
clang -O3 -mllvm -print-pipeline-passes -c main.cpp
clang -Os -mllvm -print-pipeline-passes -c main.cpp
clang -Oz -mllvm -print-pipeline-passes -c main.cpp

生成 diff 对比:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 解析 passes 到文件
clang -O0 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o0.txt
clang -O1 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o1.txt
clang -O2 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o2.txt
clang -O3 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o3.txt
clang -Os -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/os.txt
clang -Oz -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/oz.txt

# 对比 (只显示新增/移除)
diff /tmp/o0.txt /tmp/o1.txt | grep "^[<>]"
diff /tmp/o1.txt /tmp/o2.txt | grep "^[<>]"
diff /tmp/o2.txt /tmp/o3.txt | grep "^[<>]"
diff /tmp/o0.txt /tmp/os.txt | grep "^[<>]"
diff /tmp/o0.txt /tmp/oz.txt | grep "^[<>]"

O0 -> O1 Diff

新增 passes (> 表示 O1 独有, < 表示 O0 独有):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> adce
> alignment-from-assumptions
> bdce
> called-value-propagation
> constmerge
> coro-elide
> deadargelim
> div-rem-pairs
> early-cse<memssa>
> function-attrs
> globalopt
> indvars
> infer-alignment
> instcombine (多次)
> instsimplify
> ipsccp
> libcalls-shrinkwrap
> licm
> loop-deletion
> loop-distribute
> loop-unroll-full
> loop-unroll<O1>
> loop-vectorize
> memcpyopt
> reassociate
> sccp
> simple-loop-unswitch
> simplifycfg (多次)
> sroa (多次)
> tailcallelim
> vector-combine

O1 -> O2 Diff

1
2
< libcalls-shrinkwrap
< openmp-opt-cgscc

O2 -> O3 Diff

1
2
3
4
5
6
7
> argpromotion
> callsite-splitting
> chr
> loop-unroll<O3>
> simple-loop-unswitch<nontrivial;trivial>
< loop-unroll<O2>
< simple-loop-unswitch<no-nontrivial;trivial>

O2 -> Os Diff

基于 O2,优化代码大小:

1
2
< libcalls-shrinkwrap
< openmp-opt-cgscc
  • 禁用 libcalls-shrinkwrapopenmp-opt-cgscc

O2 -> Oz Diff

基于 O2,最小化代码大小:

1
2
3
4
< libcalls-shrinkwrap
< loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>
< openmp-opt-cgscc
> loop-vectorize<no-interleave-forced-only;vectorize-forced-only;>

关键差异:

  • Oz 禁用循环向量化 (loop-vectorize),改为 vectorize-forced-only
  • Os 保持 no-vectorize-forced-only (不强制向量化)

REF

  1. Options That Control Optimization
  2. Options Controlling the Kind of Output
This post is licensed under CC BY 4.0 by the author.