đ
Yours truly is not an expert in compilers or assembly, but I do enjoy delving into these topics and learning something new from time to time.
What’s this even about?
While developing my toy compiler, I found the book “Introduction to Compilers and Language Design” to be an excellent resource. Among other things, the following paragraph caught my attention.
A funny outcome of this approach is that if you generate code for an if-statement like
if (x>y) {...}
in the obvious way, you will end up with two conditional structures in the assembly code. The first conditional computes the result ofx>y
and places that in a register. The second conditional compares that results against zero and then jumps to thetrue
orfalse
branch of the statement. With a little careful coding, you can check for this common case…â 11.5 Conditional Expressions
(By the way, feel free to make jokes about funny assembly code at parties). That made me wonder, how “grown-up” compilers deal with such a scenario. Clang is a mature compiler. And its authors strive to keep the code base as understandable as possible. So, I decided to take a look into its sources.
Code samples
Before actually diving into Clang’s code, let’s first have a look at the assembly it emits for a simple conditional expression and an if
statement.
A conditional expression
Here’s a bit of C++ code.
void f(int a, int b) { bool value; value = a == b; }
It contains a boolean expression a == b
. And an assignment value = ...
, which is an expression too. Clang translates this assignment into the assembly code below.
|
|
Time to take a quick tour of this code.
On the line 4 the mov
instruction loads 4 bytes located at the memory address rbp - 4
into the register eax
. The compiler allocated these 4 bytes at rbp - 4
to store the value of a
(a parameter of the function f
).
Let’s also have a closer look at the address rbp - 4
. rbp
is a register that points to (i.e., contains the address of) the current function’s stack frame. What is a stack frame? A function reserves a portion of space on the stack for its parameters and local variables. People call this portion of the stack a frame. And so, rbp
simply points at the start of the function’s frame, a compiler makes sure of it.
With rbp
set up that way, to find a parameter’s address we just need to know it’s position in the function’s parameter list and the size of each parameter. And this is where addresses like rbp - 4
and rbp - 8
are coming from.
OK, where were we? The value of a
is in eax
. On the line 5 cmp
compares eax
(i.e., a
) to the value in memory at rbp - 8
1. This is the address of b
. Thus, on the line 5 we compare a
to b
.
On the line 6, we see sete al
. What’s it doing? In contrast to mov
, cmp
doesn’t modify its operands â the comparison result is encoded in a special EFLAGS
register. Thus, the compiler has to emit sete al
, which sets the al
register’s value either to 0x00
or 0x01
according the the comparison result in EFLAGS
. In other words, sete al
materializes the comparison result in al
.
Finally, on the line 7, mov
stores al
(i.e., the result of a == b
) into memory at rbp - 9
, which is the address of value
.
In general, when Clang is translating a boolean expression into assembly2, it emits instructions to place intermediate values into registers. So, at the next step, the compiler can emit assembly which does something with these intermediate values. E.g., stores such a value from the register into a variable’s memory (value = a == b
) or compares it to the value of another expression ((a == b) != value
), etc.
An if
statement
Here is another snippet of C++. The a == b
comparison is still there, but instead of an assignment we have an if
statement using the comparison result.
void g(int a, int b) { if (a == b) {} else {} }
And the assembly:
|
|
Clang generated instructions for a == b
a little differently this time. Can you spot it?.. You’re right on the money, dear reader! sete al
is present in the first assembly fragment, but it’s nowhere to be found in the second.
In the case of if (a == b)
, the compiler notices the if
statement is the only place where the result of a == b
is used. That means, the expression’s value is not going to end up in a variable or compared to another value. As a result, the compiler doesn’t need to allocate a register and produce the assembly instructions to store the intermediate value into that register. And so, it doesn’t.
If Clang didn’t elide these redundant instructions, how would that funny assembly the book is talking about look like? The compiler could’ve ended up translating an if (a == b) {} else {}
statement into the assembly below:
; `rbp - 4` is the address of `a`
; `rbp - 8` is the address of `b`
mov eax, dword ptr [rbp - 4] ; load `a` into `eax`
cmp eax, dword ptr [rbp - 8] ; compare `a` to `b`
sete al ; set `al` to `0x01` if equal
; or to `0x00` if not equal
test al, 1 ; this is the funny part:
; it effectively does `(a == b) == 1`
jz .LBB0_2 ; go to the else branch, if `al == 0x00`
; the empty then branch
jmp .LBB0_3 ; go to endif
.LBB0_2:
; the empty else branch
jmp .LBB0_3
.LBB0_3: ; endif
So, where exactly is Clang’s code responsible for eliding the redundant instructions?
Where is that code, already?!
We’re going to take it slow.
đ
If you’d like to run the commands from this section and step through Clang/LLVM’s code along with me, either use a Clang installed on your system or build it from the sources.
A piece of C/C++ code goes through a lot of transformations to eventually end up encoded in assembly3. These transformations are logically grouped up into stages. A single transformation stage is commonly referred to as a pass. Clang reads the C/C++ source, makes sure it doesn’t contain any syntax or semantic errors, and builds an AST out of it along the way. Then the compiler walks the AST and emits an LLVM IR code representing the original C/C++ source4.
To the first approximation, LLVM IR is a machine-independent assembly5. Imagine, someone is writing a compiler and decides to target LLVM IR â i.e., their compiler will emit LLVM IR instead of a machine-specific assembly. (x86 assembly is an example of the latter). Thanks to this decision, the developer can focus on their language’s syntax and semantics and let LLVM do optimizations (as in, making the compiled program faster and/or smaller) and conversion from LLVM IR to hardware-specific code.
Anyway, once generation of LLVM IR is complete, Clang hands off the IR code to LLVM. Which passes it through multiple stages. Majority of these perform optimizations and a couple of final ones produce machine-specific code from the LLVM IR.
There’s a way to see how each pass changes the source. (Thank you, @arnt!) And it seems like a good starting point in our search: let’s figure out which exact pass elided the redundant materialization of the comparison result. Once we know the pass, we can zoom in on it, to see how it’s implemented.
Looking for the pass
To ask Clang/LLVM to show us the result of each pass, we invoke it like this:
clang -mllvm -print-after-all -c setcc-if.cc > setcc-if.ll 2>&1
Where setcc-if.cc contains a single line of C++:
void g(int a, int b) { if (a == b) {} else {} }
In response the compiler writes a ton of useful info into the file setcc-if.ll. I’ll only show excerpts relevant to the question at hand.
Originally, the intermediate representation of the code in setcc-if.cc is:
|
|
It indeed looks a lot like a dialect of assembly language. What does it do? On the line 9 icmp
compares a
(%0
) to b
(%1
) and assigns the result to %cmp
. Next, on the line 10, br
inspects %cmp
and jumps either to the label if.then
or if.else
accordingly. As we’ve seen earlier, materialization of this comparison’s result (%cmp
) translates into sete al
x86 instruction. Thus, we’re looking for the pass that elides %cmp
.
The thing to notice here is LLVM IR doesn’t have a concept of the EFLAGS
register or anything similar. Which means, for any pass operating on LLVM IR it’s impossible to elide %cmp
. Therefore, we should move on to the first one, which doesn’t express its output in LLVM IR.
Here you go.
# *** IR Dump After X86 DAG->DAG Instruction Selection (x86-isel) ***:
# Machine code for function ?g@@YAXHH@Z: IsSSA, TracksLiveness
Frame Objects:
fi#0: size=4, align=4, at location [SP+8]
fi#1: size=4, align=4, at location [SP+8]
Function Live Ins: $ecx in %0, $edx in %1
bb.0.entry:
successors: %bb.2, %bb.1
liveins: $ecx, $edx
%1:gr32 = COPY $edx
%0:gr32 = COPY $ecx
MOV32mr %stack.0.b.addr, 1, $noreg, 0, $noreg, %1:gr32 :: (store (s32) into %ir.b.addr)
MOV32mr %stack.1.a.addr, 1, $noreg, 0, $noreg, %0:gr32 :: (store (s32) into %ir.a.addr)
%4:gr32 = MOV32rm %stack.1.a.addr, 1, $noreg, 0, $noreg :: (load (s32) from %ir.a.addr)
CMP32rm %4:gr32, %stack.0.b.addr, 1, $noreg, 0, $noreg, implicit-def $eflags :: (load (s32) from %ir.b.addr)
JCC_1 %bb.2, 5, implicit $eflags
bb.1.if.then:
; predecessors: %bb.0
successors: %bb.3
JMP_1 %bb.3
bb.2.if.else:
; predecessors: %bb.0
successors: %bb.3
JMP_1 %bb.3
bb.3.if.end:
; predecessors: %bb.2, %bb.1
RET64
# End machine code for function ?g@@YAXHH@Z.
This is Machine IR (MIR) â “a human readable serialization format that is used to represent LLVMâs machine specific intermediate representation”. It’s much closer to the actual x86 assembly, though, obviously, not quite the same.
Drumroll, please! %cmp
is gone, while sete al
(or rather %X:gr8 = SETCCr ...
) is nowhere to be found! Looks like, it’s the source code of the pass “X86 DAG->DAG Instruction Selection (x86-isel)” we’re going to be delving in.
Examining the code
Clang/LLVM’s code base is not a small one. How are we going to find the source code of the pass? Attempting a couple of text searches in the entire solutions sounds worth a try. Searching for “X86 DAG->DAG Instruction Selection (x86-isel)” returns no results, but “X86 DAG->DAG Instruction Selection” gives two.
Clearly, X86ISelDAGToDAG.cpp is the search result we’re looking for. The file contains the X86DAGToDAGISel
class, which implements the “X86 DAG->DAG Instruction Selection” pass. A comment in the code describes its job: “X86-specific code to select X86 machine instructions for SelectionDAG
operations”.
Now, how do we find the exact place inside of it where the elision of %cmp
happens? One powerful technique of reasoning about code is to randomly browse around and stare at it intensely. Applying it to X86DAGToDAGISel
and its parent SelectionDAGISel
has proven to be very effective. The method X86DAGToDAGISel::runOnMachineFunction
looks like it could be the entry point of the pass â a good place to start following the call chain, that should sooner or later lead us to the point where the next emitted MIR instruction is decided.
X86DAGToDAGISel::runOnMachineFunction
eventually calls the parent’s method it overrides â SelectionDAGISel::runOnMachineFunction
. This is a relatively big method, so let’s step through it in a debugger. This way we aren’t going to be distracted by the branches of code that don’t actually get executed on the one hand, and won’t miss anything that does get executed on the other. Eventually, the execution reaches the line 482, where a call to SelectionDAGISel::SelectAllBasicBlocks
is made. This method “[p]erform[s] instruction selection on all basic blocks in the function”. Looks like, we’re headed in the right direction.
Stepping through SelectAllBasicBlocks
, we reach the for
loop on the line 1563 of SelectionDAGISel.cpp, which “[d]o[es] FastISel
on as many instructions as possible”:
// Do FastISel on as many instructions as possible.
for (; BI != Begin; --BI) {
// [The loop body's edited out for brevity]
}
Notice, the loop goes backwards: from the last instruction of a basic block to the first one. A little later we arrive at the line 1578
if (FastIS->selectInstruction(Inst)) { /* [ Edited out for brevity ] */ }
It calls into FastISel::selectInstruction
, and that one, in turn, invokes X86FastISel::fastSelectInstruction
on the line 1559. This is where the execution goes from FastISel
containing a generic instruction algorithm to X86FastISel
, which is target-specific.
Peeking inside of X86FastISel::fastSelectInstruction
, we see something interesting.
bool
X86FastISel::fastSelectInstruction(const Instruction *I) {
switch (I->getOpcode()) {
// [ Edited out for brevity ]
case Instruction::ICmp:
case Instruction::FCmp:
return X86SelectCmp(I);
// [ Edited out for brevity ]
case Instruction::Br:
return X86SelectBranch(I);
// [ Edited out for brevity ]
}
return false;
}
Apparently, each case
corresponds to an LLVM IR instruction fastSelectInstruction
can deal with. Now, let’s take a step back for a second. We can make an educated guess: the redundant value %cmp
is potentially elided either by the icmp
or br
LLVM IR instruction handler. Getting back to the method, it has branches for both of these. Studying the corresponding methods X86FastISel::X86SelectCmp
and X86FastISel::X86SelectBranch
is an obvious next step.
X86FastISel::X86SelectCmp
doesn’t seem to contain any logic for eliding the redundant sete al
/%X:gr8 = SETCCr ...
. Very much to the contrary, it unconditionally emits X86::SETCCr
on the line 1520.
BuildMI(*FuncInfo.MBB, FuncInfo.InsertPt, MIMD, TII.get(X86::SETCCr),
ResultReg).addImm(CC);
Maybe, we’ll have a better luck with X86FastISel::X86SelectBranch
? (Yeah, alright, you known I know we will. And I know you know… And I’m here all week.) Indeed, on the line 1635 we have:
// Fold the common case of a conditional branch with a comparison
// in the same block (values defined on other blocks may not have
// initialized registers).
X86::CondCode CC;
if (const CmpInst *CI = dyn_cast<CmpInst>(BI->getCondition())) {
if (CI->hasOneUse() && CI->getParent() == I->getParent()) {
// [ Edited out for brevity ]
}
// [ Edited out for brevity ]
}
BI
stands for branch instruction and points to the object representing this line of IR: br i1 %cmp, label %if.then, label %if.else
. CI
stands for conditional instruction and points to this line of IR: %cmp = icmp eq i32 %0, %1
. CI->hasOneUse()
is true
in case %cmp
is referenced only once in the IR (aside from assigning it a value, of course). For example, if the IR contained an instruction %frombool = zext i1 %cmp to i8
, it would constitute another use, and as a result, CI->hasOneUse()
would evaluate to false
. CI->getParent() == I->getParent()
checks whether the conditional and branching instructions reside in the same basic block â they do in our case (this post isn’t going into more detail about it).
If a branching instruction satisfies the above conditions, LLVM emits the MIR comparison instruction CMP32rm
directly instead of relying on X86FastISel::X86SelectCmp
, which emits an unnecessary SETCCr
unconditionally. Hence, LLVM avoids generating SETCCr
when there is nobody “waiting” to use a materialized result of the comparison. Bingo!
But wait, the next iteration of the for
loop in SelectionDAGISel::SelectAllBasicBlocks
is going to pick up %cmp = icmp eq i32 %0, %1
(remember, the loop is going backwards). It will pass the instruction to X86FastISel::X86SelectCmp
and we know that one emits SETCCr
. What gives?
To answer this, we need to focus on two more spots in the LLVM sources. The first one is inside of X86FastISel::X86SelectBranch
. The execution reaches it when it’s impossible to fold a conditional branch with a comparison.
// Otherwise do a clumsy setcc and re-test it.
// Note that i1 essentially gets ANY_EXTEND'ed to i8 where it isn't used
// in an explicit cast, so make sure to handle that correctly.
Register OpReg = getRegForValue(BI->getCondition());
The method FastISel::getRegForValue
â through a chain of calls â puts the comparison instruction BI->getCondition()
into a map SelectionDAGISel::FuncInfo::ValueMap
.
The second one is a call to isFoldedOrDeadInstruction
at the very start of the for
loop’s body in SelectionDAGISel::SelectAllBasicBlocks
. isFoldedOrDeadInstruction
runs a number of checks to determine if the instruction should be skipped. As one of these checks, it looks up the instruction in SelectionDAGISel::FuncInfo::ValueMap
. If the instruction is there, the appropriate handler is invoked. If it is not, the loop skips to the next instruction.
// Do FastISel on as many instructions as possible.
for (; BI != Begin; --BI) {
const Instruction *Inst = &*std::prev(BI);
// If we no longer require this instruction, skip it.
if (isFoldedOrDeadInstruction(Inst, *FuncInfo) ||
ElidedArgCopyInstrs.count(Inst)) {
--NumFastIselRemaining;
continue;
}
// [ ... ]
}
To reiterate, if X86FastISel::X86SelectBranch
can’t fold a comparison into a conditional branching, it places the comparison instruction into the SelectionDAGISel::FuncInfo::ValueMap
. So when SelectionDAGISel::SelectAllBasicBlocks
picks that instruction for processing, it known the instruction is “alive” and invokes X86FastISel::X86SelectCmp
to emit the corresponding MIR instructions as usual. But in case a comparison has actually been folded into a conditional branching, the comparison is not placed into ValueMap
, thus SelectionDAGISel::SelectAllBasicBlocks
skips it on the next step.
Conclusion
I guess, it’s fair to say, there is no generalized optimization technique in LLVM that gets rid of redundant comparison result materializations. Just “a little careful coding”…
And thank you for reading! (Hey, if you just jumped to the bottom without reading, that’s OK too â seriously, don’t feel bad about it.)
How come, we loaded
a
into a register, butb
we’re referencing simply by its address? Why go the extra mile when we can compare two addresses directly? Well, that’s because we can’t. (Try the veal, by the way). x86 machine code can’t encode an instruction with two explicit memory operands. ↩︎Clang, of course, doesn’t directly generate assembly â it produces LLVM IR. ↩︎
Or rather machine code, but we won’t touch on this topic here. ↩︎
Clang doesn’t actually wait to parse the entire source file to build its IR representation. Instead, it parses one top-level declaration at a time, emits IR instructions for it, and then moves on to the next one. Once the IR module for the source file is complete, it gets handed off to LLVM’s codegen pipeline. If you’d like a deeper dive, a good spot to start from is
void clang::ParseAST(Sema &S, bool PrintStats, bool SkipFunctionBodies)
. ↩︎Well, according to LLVM’s documentation it actually is an assembly language. ↩︎