gcc的optimize flags
简单的记录一下gcc的优化选项,以及一些细节。
正常情况下,能选择开/关的编译器优化,只有有符号的哪些
你可以通过
1
2
3
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
看哪些优化被开启了,O2和O3的区别。
要注意一点,debug的编译尽量使用-Og或者使用-O1 or -O0, 不要让inline进入到你的debug编译,这样的坏处是断点的时候会出现很奇怪的跳转,代码对不准,具体分析问题可能要看汇编了
默认优化-O0
O0是默认的优化选项,理论上是不进行任何优化,但是在查阅资料之后发现也有一些优化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-faggressive-loop-optimizations [enabled]
-fallocation-dce [enabled]
-fasynchronous-unwind-tables [enabled]
-fauto-inc-dec [enabled]
-fbit-tests [enabled]
-fdce [enabled]
-fearly-inlining [enabled]
-ffp-int-builtin-inexact [enabled]
-ffunction-cse [enabled]
-fgcse-lm [enabled]
-finline-atomics [enabled]
-fipa-stack-alignment [enabled]
-fipa-strict-aliasing [enabled]
-fira-hoist-pressure [enabled]
-fira-share-save-slots [enabled]
-fira-share-spill-slots [enabled]
-fivopts [enabled]
-fjump-tables [enabled]
-flifetime-dse [enabled]
-fmath-errno [enabled]
-fpeephole [enabled]
-fplt [enabled]
-fprintf-return-value [enabled]
-freg-struct-return [enabled]
-fsched-critical-path-heuristic [enabled]
-fsched-dep-count-heuristic [enabled]
-fsched-group-heuristic [enabled]
-fsched-interblock [enabled]
-fsched-last-insn-heuristic [enabled]
-fsched-rank-heuristic [enabled]
-fsched-spec [enabled]
-fsched-spec-insn-heuristic [enabled]
-fsched-stalled-insns-dep [enabled]
-fschedule-fusion [enabled]
-fsemantic-interposition [enabled]
-fshort-enums [enabled]
-fshrink-wrap-separate [enabled]
-fsigned-zeros [enabled]
-fsplit-ivs-in-unroller [enabled]
-fssa-backprop [enabled]
-fstdarg-opt [enabled]
-ftrapping-math [enabled]
-ftree-forwprop [enabled]
-ftree-loop-im [enabled]
-ftree-loop-ivcanon [enabled]
-ftree-loop-optimize [enabled]
-ftree-phiprop [enabled]
-ftree-reassoc [enabled]
-ftree-scev-cprop [enabled]
-funreachable-traps [enabled]
-funwind-tables [enabled]
-O1优化
简单的看下-O1的描述
1
Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function. With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.
实际进行下面的优化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
-faggressive-loop-optimizations [enabled]
-fallocation-dce [enabled]
-fasynchronous-unwind-tables [enabled]
-fauto-inc-dec [enabled]
-fbit-tests [enabled]
-fbranch-count-reg [enabled]
-fcombine-stack-adjustments [enabled]
-fcompare-elim [enabled]
-fcprop-registers [enabled]
-fdce [enabled]
-fdefer-pop [enabled]
-fdse [enabled]
-fearly-inlining [enabled]
-fforward-propagate [enabled]
-ffp-int-builtin-inexact [enabled]
-ffunction-cse [enabled]
-fgcse-lm [enabled]
-fguess-branch-probability [enabled]
-fif-conversion [enabled]
-fif-conversion2 [enabled]
-finline [enabled]
-finline-atomics [enabled]
-finline-functions-called-once [enabled]
-fipa-modref [enabled]
-fipa-profile [enabled]
-fipa-pure-const [enabled]
-fipa-reference [enabled]
-fipa-reference-addressable [enabled]
-fipa-stack-alignment [enabled]
-fipa-strict-aliasing [enabled]
-fira-hoist-pressure [enabled]
-fira-share-save-slots [enabled]
-fira-share-spill-slots [enabled]
-fivopts [enabled]
-fjump-tables [enabled]
-flifetime-dse [enabled]
-fmath-errno [enabled]
-fmove-loop-invariants [enabled]
-fmove-loop-stores [enabled]
-fomit-frame-pointer [enabled]
-fpeephole [enabled]
-fplt [enabled]
-fprintf-return-value [enabled]
-freg-struct-return [enabled]
-freorder-blocks [enabled]
-fsched-critical-path-heuristic [enabled]
-fsched-dep-count-heuristic [enabled]
-fsched-group-heuristic [enabled]
-fsched-interblock [enabled]
-fsched-last-insn-heuristic [enabled]
-fsched-rank-heuristic [enabled]
-fsched-spec [enabled]
-fsched-spec-insn-heuristic [enabled]
-fsched-stalled-insns-dep [enabled]
-fschedule-fusion [enabled]
-fsemantic-interposition [enabled]
-fshort-enums [enabled]
-fshrink-wrap [enabled]
-fshrink-wrap-separate [enabled]
-fsigned-zeros [enabled]
-fsplit-ivs-in-unroller [enabled]
-fsplit-wide-types [enabled]
-fssa-backprop [enabled]
-fssa-phiopt [enabled]
-fstdarg-opt [enabled]
-fthread-jumps [enabled]
-ftoplevel-reorder [enabled]
-ftrapping-math [enabled]
-ftree-bit-ccp [enabled]
-ftree-builtin-call-dce [enabled]
-ftree-ccp [enabled]
-ftree-ch [enabled]
-ftree-coalesce-vars [enabled]
-ftree-copy-prop [enabled]
-ftree-dce [enabled]
-ftree-dominator-opts [enabled]
-ftree-dse [enabled]
-ftree-forwprop [enabled]
-ftree-fre [enabled]
-ftree-loop-im [enabled]
-ftree-loop-ivcanon [enabled]
-ftree-loop-optimize [enabled]
-ftree-phiprop [enabled]
-ftree-pta [enabled]
-ftree-reassoc [enabled]
-ftree-scev-cprop [enabled]
-ftree-sink [enabled]
-ftree-slsr [enabled]
-ftree-sra [enabled]
-ftree-ter [enabled]
-funwind-tables [enabled]
-O2优化
O2相较于O1,他的描述显得激进了一点
1
Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.
在O1的基础上, O2还做了下面的这些优化
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
< -falign-functions [enabled]
< -falign-jumps [enabled]
< -falign-labels [enabled]
< -falign-loops [enabled]
< -fcaller-saves [enabled]
< -fcode-hoisting [enabled]
< -fcrossjumping [enabled]
< -fcse-follow-jumps [enabled]
< -fdevirtualize [enabled]
< -fdevirtualize-speculatively [enabled]
< -fexpensive-optimizations [enabled]
< -fgcse [enabled]
< -fhoist-adjacent-loads [enabled]
< -findirect-inlining [enabled]
< -finline-functions [enabled]
< -finline-small-functions [enabled]
< -fipa-bit-cp [enabled]
< -fipa-cp [enabled]
< -fipa-icf [enabled]
< -fipa-icf-functions [enabled]
< -fipa-icf-variables [enabled]
< -fipa-ra [enabled]
< -fipa-sra [enabled]
< -fipa-vrp [enabled]
< -fisolate-erroneous-paths-dereference [enabled]
< -flra-remat [enabled]
< -foptimize-sibling-calls [enabled]
< -foptimize-strlen [enabled]
< -fpartial-inlining [enabled]
< -fpeephole2 [enabled]
< -free [enabled]
< -freorder-blocks-and-partition [enabled]
< -freorder-functions [enabled]
< -frerun-cse-after-loop [enabled]
< -fschedule-insns2 [enabled]
< -fstore-merging [enabled]
< -fstrict-aliasing [enabled]
< -ftree-loop-distribute-patterns [enabled]
< -ftree-loop-vectorize [enabled]
< -ftree-pre [enabled]
< -ftree-slp-vectorize [enabled]
< -ftree-switch-conversion [enabled]
< -ftree-tail-merge [enabled]
< -ftree-vrp [enabled]
< -funroll-loops [enabled]
-O3优化
再额外扩充一下
1
2
3
4
5
6
7
8
9
10
11
12
13
> -fgcse-after-reload [enabled]
> -fipa-cp-clone [enabled]
> -floop-interchange [enabled]
> -floop-unroll-and-jam [enabled]
> -fpeel-loops [enabled]
> -fpredictive-commoning [enabled]
> -fsplit-loops [enabled]
> -fsplit-paths [enabled]
> -ftree-loop-distribution [enabled]
> -ftree-partial-pre [enabled]
> -funroll-completely-grow-size [enabled]
> -funswitch-loops [enabled]
> -fversion-loops-for-strides [enabled]
剩下的哪些优化选项就自己后面再看了
clang的优化
Clang 的底层是 LLVM,它的优化机制是基于”Pass 管道(Pass Pipeline)”的。-O2并不是简单地打开一堆布尔开关,而是构建了一条特定的LLVM IR优化Pass处理流。
获取各优化级别的 passes:
1
2
3
4
5
6
7
# 获取 passes (需要有一个 .cpp 文件)
clang -O0 -mllvm -print-pipeline-passes -c main.cpp
clang -O1 -mllvm -print-pipeline-passes -c main.cpp
clang -O2 -mllvm -print-pipeline-passes -c main.cpp
clang -O3 -mllvm -print-pipeline-passes -c main.cpp
clang -Os -mllvm -print-pipeline-passes -c main.cpp
clang -Oz -mllvm -print-pipeline-passes -c main.cpp
生成 diff 对比:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 解析 passes 到文件
clang -O0 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o0.txt
clang -O1 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o1.txt
clang -O2 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o2.txt
clang -O3 -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/o3.txt
clang -Os -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/os.txt
clang -Oz -mllvm -print-pipeline-passes -c main.cpp 2>&1 | tr ',' '\n' | sort > /tmp/oz.txt
# 对比 (只显示新增/移除)
diff /tmp/o0.txt /tmp/o1.txt | grep "^[<>]"
diff /tmp/o1.txt /tmp/o2.txt | grep "^[<>]"
diff /tmp/o2.txt /tmp/o3.txt | grep "^[<>]"
diff /tmp/o0.txt /tmp/os.txt | grep "^[<>]"
diff /tmp/o0.txt /tmp/oz.txt | grep "^[<>]"
O0 -> O1 Diff
新增 passes (> 表示 O1 独有, < 表示 O0 独有):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
> adce
> alignment-from-assumptions
> bdce
> called-value-propagation
> constmerge
> coro-elide
> deadargelim
> div-rem-pairs
> early-cse<memssa>
> function-attrs
> globalopt
> indvars
> infer-alignment
> instcombine (多次)
> instsimplify
> ipsccp
> libcalls-shrinkwrap
> licm
> loop-deletion
> loop-distribute
> loop-unroll-full
> loop-unroll<O1>
> loop-vectorize
> memcpyopt
> reassociate
> sccp
> simple-loop-unswitch
> simplifycfg (多次)
> sroa (多次)
> tailcallelim
> vector-combine
O1 -> O2 Diff
1
2
< libcalls-shrinkwrap
< openmp-opt-cgscc
O2 -> O3 Diff
1
2
3
4
5
6
7
> argpromotion
> callsite-splitting
> chr
> loop-unroll<O3>
> simple-loop-unswitch<nontrivial;trivial>
< loop-unroll<O2>
< simple-loop-unswitch<no-nontrivial;trivial>
O2 -> Os Diff
基于 O2,优化代码大小:
1
2
< libcalls-shrinkwrap
< openmp-opt-cgscc
- 禁用
libcalls-shrinkwrap和openmp-opt-cgscc
O2 -> Oz Diff
基于 O2,最小化代码大小:
1
2
3
4
< libcalls-shrinkwrap
< loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>
< openmp-opt-cgscc
> loop-vectorize<no-interleave-forced-only;vectorize-forced-only;>
关键差异:
- Oz 禁用循环向量化 (
loop-vectorize),改为vectorize-forced-only - Os 保持
no-vectorize-forced-only(不强制向量化)
REF
This post is licensed under CC BY 4.0 by the author.