🐌
Yours truly is not an expert in compilers or assembly, but I do enjoy delving into these topics and learning something new from time to time.

What’s this even about?

While developing my toy compiler, I found the book “Introduction to Compilers and Language Design” to be an excellent resource. Among other things, the following paragraph caught my attention.

A funny outcome of this approach is that if you generate code for an if-statement like if (x>y) {...} in the obvious way, you will end up with two conditional structures in the assembly code. The first conditional computes the result of x>y and places that in a register. The second conditional compares that results against zero and then jumps to the true or false branch of the statement. With a little careful coding, you can check for this common case…

— 11.5 Conditional Expressions

(By the way, feel free to make jokes about funny assembly code at parties). That made me wonder, how “grown-up” compilers deal with such a scenario. Clang is a mature compiler. And its authors strive to keep the code base as understandable as possible. So, I decided to take a look into its sources.

Code samples

Before actually diving into Clang’s code, let’s first have a look at the assembly it emits for a simple conditional expression and an if statement.

A conditional expression

Here’s a bit of C++ code.

void f(int a, int b) { bool value; value = a == b; }

It contains a boolean expression a == b. And an assignment value = ..., which is an expression too. Clang translates this assignment into the assembly code below.

1
2
3
4
5
6
7
8
; `rbp - 4` is the address of `a`
; `rbp - 8` is the address of `b`
; `rbp - 9` is the address of `value`
mov     eax, dword ptr [rbp - 4] ; load `a` into `eax`
cmp     eax, dword ptr [rbp - 8] ; compare `a` to `b`
sete    al                       ; set `al` to `0x01` if equal
                                 ;       or to `0x00` if not equal
mov     byte ptr [rbp - 9], al   ; store the result in `value`

Time to take a quick tour of this code.

On the line 4 the mov instruction loads 4 bytes located at the memory address rbp - 4 into the register eax. The compiler allocated these 4 bytes at rbp - 4 to store the value of a (a parameter of the function f).

Let’s also have a closer look at the address rbp - 4. rbp is a register that points to (i.e., contains the address of) the current function’s stack frame. What is a stack frame? A function reserves a portion of space on the stack for its parameters and local variables. People call this portion of the stack a frame. And so, rbp simply points at the start of the function’s frame, a compiler makes sure of it.

With rbp set up that way, to find a parameter’s address we just need to know it’s position in the function’s parameter list and the size of each parameter. And this is where addresses like rbp - 4 and rbp - 8 are coming from.

OK, where were we? The value of a is in eax. On the line 5 cmp compares eax (i.e., a) to the value in memory at rbp - 81. This is the address of b. Thus, on the line 5 we compare a to b.

On the line 6, we see sete al. What’s it doing? In contrast to mov, cmp doesn’t modify its operands — the comparison result is encoded in a special EFLAGS register. Thus, the compiler has to emit sete al, which sets the al register’s value either to 0x00 or 0x01 according the the comparison result in EFLAGS. In other words, sete al materializes the comparison result in al.

Finally, on the line 7, mov stores al (i.e., the result of a == b) into memory at rbp - 9, which is the address of value.

In general, when Clang is translating a boolean expression into assembly2, it emits instructions to place intermediate values into registers. So, at the next step, the compiler can emit assembly which does something with these intermediate values. E.g., stores such a value from the register into a variable’s memory (value = a == b) or compares it to the value of another expression ((a == b) != value), etc.

An if statement

Here is another snippet of C++. The a == b comparison is still there, but instead of an assignment we have an if statement using the comparison result.

void g(int a, int b) { if (a == b) {} else {} }

And the assembly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  ; `rbp - 4` is the address of `a`
  ; `rbp - 8` is the address of `b`
  mov     eax, dword ptr [rbp - 4] ; load `a` into `eax`
  cmp     eax, dword ptr [rbp - 8] ; compare `a` to `b`
  jne     .LBB0_2                  ; go to the else branch, if not equal
  ; the empty then branch
  jmp     .LBB0_3                  ; go to endif
.LBB0_2:
  ; the empty else branch
  jmp     .LBB0_3                  ; go to endif
.LBB0_3:                           ; endif

Clang generated instructions for a == b a little differently this time. Can you spot it?.. You’re right on the money, dear reader! sete al is present in the first assembly fragment, but it’s nowhere to be found in the second.

In the case of if (a == b), the compiler notices the if statement is the only place where the result of a == b is used. That means, the expression’s value is not going to end up in a variable or compared to another value. As a result, the compiler doesn’t need to allocate a register and produce the assembly instructions to store the intermediate value into that register. And so, it doesn’t.

If Clang didn’t elide these redundant instructions, how would that funny assembly the book is talking about look like? The compiler could’ve ended up translating an if (a == b) {} else {} statement into the assembly below:

  ; `rbp - 4` is the address of `a`
  ; `rbp - 8` is the address of `b`
  mov     eax, dword ptr [rbp - 4] ; load `a` into `eax`
  cmp     eax, dword ptr [rbp - 8] ; compare `a` to `b`
  sete    al                       ; set `al` to `0x01` if equal
                                   ;       or to `0x00` if not equal 
  test    al, 1                    ; this is the funny part:
                                   ; it effectively does `(a == b) == 1`
  jz     .LBB0_2                   ; go to the else branch, if `al == 0x00`
  ; the empty then branch
  jmp     .LBB0_3                  ; go to endif
.LBB0_2:
  ; the empty else branch
  jmp     .LBB0_3
.LBB0_3:                           ; endif

So, where exactly is Clang’s code responsible for eliding the redundant instructions?

Where is that code, already?!

We’re going to take it slow.

📝
If you’d like to run the commands from this section and step through Clang/LLVM’s code along with me, either use a Clang installed on your system or build it from the sources.

A piece of C/C++ code goes through a lot of transformations to eventually end up encoded in assembly3. These transformations are logically grouped up into stages. A single transformation stage is commonly referred to as a pass. Clang reads the C/C++ source, makes sure it doesn’t contain any syntax or semantic errors, and builds an AST out of it along the way. Then the compiler walks the AST and emits an LLVM IR code representing the original C/C++ source4.

To the first approximation, LLVM IR is a machine-independent assembly5. Imagine, someone is writing a compiler and decides to target LLVM IR — i.e., their compiler will emit LLVM IR instead of a machine-specific assembly. (x86 assembly is an example of the latter). Thanks to this decision, the developer can focus on their language’s syntax and semantics and let LLVM do optimizations (as in, making the compiled program faster and/or smaller) and conversion from LLVM IR to hardware-specific code.

Anyway, once generation of LLVM IR is complete, Clang hands off the IR code to LLVM. Which passes it through multiple stages. Majority of these perform optimizations and a couple of final ones produce machine-specific code from the LLVM IR.

There’s a way to see how each pass changes the source. (Thank you, @arnt!) And it seems like a good starting point in our search: let’s figure out which exact pass elided the redundant materialization of the comparison result. Once we know the pass, we can zoom in on it, to see how it’s implemented.

Looking for the pass

To ask Clang/LLVM to show us the result of each pass, we invoke it like this:

clang -mllvm -print-after-all -c setcc-if.cc > setcc-if.ll 2>&1

Where setcc-if.cc contains a single line of C++:

void g(int a, int b) { if (a == b) {} else {} }

In response the compiler writes a ton of useful info into the file setcc-if.ll. I’ll only show excerpts relevant to the question at hand.

Originally, the intermediate representation of the code in setcc-if.cc is:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
define dso_local void @"?g@@YAXHH@Z"(i32 noundef %a, i32 noundef %b) #0 {
entry:
  %b.addr = alloca i32, align 4
  %a.addr = alloca i32, align 4
  store i32 %b, ptr %b.addr, align 4
  store i32 %a, ptr %a.addr, align 4
  %0 = load i32, ptr %a.addr, align 4
  %1 = load i32, ptr %b.addr, align 4
  %cmp = icmp eq i32 %0, %1
  br i1 %cmp, label %if.then, label %if.else

if.then:                                      ; preds = %entry
  br label %if.end
if.else:                                      ; preds = %entry
  br label %if.end
if.end:                                       ; preds = %if.else, %if.then
  ret void
}

It indeed looks a lot like a dialect of assembly language. What does it do? On the line 9 icmp compares a (%0) to b (%1) and assigns the result to %cmp. Next, on the line 10, br inspects %cmp and jumps either to the label if.then or if.else accordingly. As we’ve seen earlier, materialization of this comparison’s result (%cmp) translates into sete al x86 instruction. Thus, we’re looking for the pass that elides %cmp.

The thing to notice here is LLVM IR doesn’t have a concept of the EFLAGS register or anything similar. Which means, for any pass operating on LLVM IR it’s impossible to elide %cmp. Therefore, we should move on to the first one, which doesn’t express its output in LLVM IR.

Here you go.

# *** IR Dump After X86 DAG->DAG Instruction Selection (x86-isel) ***:
# Machine code for function ?g@@YAXHH@Z: IsSSA, TracksLiveness
Frame Objects:
  fi#0: size=4, align=4, at location [SP+8]
  fi#1: size=4, align=4, at location [SP+8]
Function Live Ins: $ecx in %0, $edx in %1

bb.0.entry:
  successors: %bb.2, %bb.1
  liveins: $ecx, $edx
  %1:gr32 = COPY $edx
  %0:gr32 = COPY $ecx
  MOV32mr %stack.0.b.addr, 1, $noreg, 0, $noreg, %1:gr32 :: (store (s32) into %ir.b.addr)
  MOV32mr %stack.1.a.addr, 1, $noreg, 0, $noreg, %0:gr32 :: (store (s32) into %ir.a.addr)
  %4:gr32 = MOV32rm %stack.1.a.addr, 1, $noreg, 0, $noreg :: (load (s32) from %ir.a.addr)
  CMP32rm %4:gr32, %stack.0.b.addr, 1, $noreg, 0, $noreg, implicit-def $eflags :: (load (s32) from %ir.b.addr)
  JCC_1 %bb.2, 5, implicit $eflags

bb.1.if.then:
; predecessors: %bb.0
  successors: %bb.3

  JMP_1 %bb.3

bb.2.if.else:
; predecessors: %bb.0
  successors: %bb.3

  JMP_1 %bb.3

bb.3.if.end:
; predecessors: %bb.2, %bb.1

  RET64

# End machine code for function ?g@@YAXHH@Z.

This is Machine IR (MIR) — “a human readable serialization format that is used to represent LLVM’s machine specific intermediate representation”. It’s much closer to the actual x86 assembly, though, obviously, not quite the same.

Drumroll, please! %cmp is gone, while sete al (or rather %X:gr8 = SETCCr ...) is nowhere to be found! Looks like, it’s the source code of the pass “X86 DAG->DAG Instruction Selection (x86-isel)” we’re going to be delving in.

Examining the code

Clang/LLVM’s code base is not a small one. How are we going to find the source code of the pass? Attempting a couple of text searches in the entire solutions sounds worth a try. Searching for “X86 DAG->DAG Instruction Selection (x86-isel)” returns no results, but “X86 DAG->DAG Instruction Selection” gives two.

Find in files results

Clearly, X86ISelDAGToDAG.cpp is the search result we’re looking for. The file contains the X86DAGToDAGISel class, which implements the “X86 DAG->DAG Instruction Selection” pass. A comment in the code describes its job: “X86-specific code to select X86 machine instructions for SelectionDAG operations”.

Now, how do we find the exact place inside of it where the elision of %cmp happens? One powerful technique of reasoning about code is to randomly browse around and stare at it intensely. Applying it to X86DAGToDAGISel and its parent SelectionDAGISel has proven to be very effective. The method X86DAGToDAGISel::runOnMachineFunction looks like it could be the entry point of the pass — a good place to start following the call chain, that should sooner or later lead us to the point where the next emitted MIR instruction is decided.

X86DAGToDAGISel::runOnMachineFunction eventually calls the parent’s method it overrides — SelectionDAGISel::runOnMachineFunction. This is a relatively big method, so let’s step through it in a debugger. This way we aren’t going to be distracted by the branches of code that don’t actually get executed on the one hand, and won’t miss anything that does get executed on the other. Eventually, the execution reaches the line 482, where a call to SelectionDAGISel::SelectAllBasicBlocks is made. This method “[p]erform[s] instruction selection on all basic blocks in the function”. Looks like, we’re headed in the right direction.

Stepping through SelectAllBasicBlocks, we reach the for loop on the line 1563 of SelectionDAGISel.cpp, which “[d]o[es] FastISel on as many instructions as possible”:

// Do FastISel on as many instructions as possible.
for (; BI != Begin; --BI) {
    // [The loop body's edited out for brevity]
}

Notice, the loop goes backwards: from the last instruction of a basic block to the first one. A little later we arrive at the line 1578

if (FastIS->selectInstruction(Inst)) { /* [ Edited out for brevity ] */ }

It calls into FastISel::selectInstruction, and that one, in turn, invokes X86FastISel::fastSelectInstruction on the line 1559. This is where the execution goes from FastISel containing a generic instruction algorithm to X86FastISel, which is target-specific.

Peeking inside of X86FastISel::fastSelectInstruction, we see something interesting.

bool
X86FastISel::fastSelectInstruction(const Instruction *I)  {
  switch (I->getOpcode()) {
  // [ Edited out for brevity ]
  case Instruction::ICmp:
  case Instruction::FCmp:
    return X86SelectCmp(I);
  // [ Edited out for brevity ]
  case Instruction::Br:
    return X86SelectBranch(I);
    // [ Edited out for brevity ]
  }

  return false;
}

Apparently, each case corresponds to an LLVM IR instruction fastSelectInstruction can deal with. Now, let’s take a step back for a second. We can make an educated guess: the redundant value %cmp is potentially elided either by the icmp or br LLVM IR instruction handler. Getting back to the method, it has branches for both of these. Studying the corresponding methods X86FastISel::X86SelectCmp and X86FastISel::X86SelectBranch is an obvious next step.

X86FastISel::X86SelectCmp doesn’t seem to contain any logic for eliding the redundant sete al/%X:gr8 = SETCCr .... Very much to the contrary, it unconditionally emits X86::SETCCr on the line 1520.

BuildMI(*FuncInfo.MBB, FuncInfo.InsertPt, MIMD, TII.get(X86::SETCCr),
        ResultReg).addImm(CC);

Maybe, we’ll have a better luck with X86FastISel::X86SelectBranch? (Yeah, alright, you known I know we will. And I know you know… And I’m here all week.) Indeed, on the line 1635 we have:

// Fold the common case of a conditional branch with a comparison
// in the same block (values defined on other blocks may not have
// initialized registers).
X86::CondCode CC;
if (const CmpInst *CI = dyn_cast<CmpInst>(BI->getCondition())) {
  if (CI->hasOneUse() && CI->getParent() == I->getParent()) {
    // [ Edited out for brevity ]
  }
  // [ Edited out for brevity ]
}

BI stands for branch instruction and points to the object representing this line of IR: br i1 %cmp, label %if.then, label %if.else. CI stands for conditional instruction and points to this line of IR: %cmp = icmp eq i32 %0, %1. CI->hasOneUse() is true in case %cmp is referenced only once in the IR (aside from assigning it a value, of course). For example, if the IR contained an instruction %frombool = zext i1 %cmp to i8, it would constitute another use, and as a result, CI->hasOneUse() would evaluate to false. CI->getParent() == I->getParent() checks whether the conditional and branching instructions reside in the same basic block — they do in our case (this post isn’t going into more detail about it).

If a branching instruction satisfies the above conditions, LLVM emits the MIR comparison instruction CMP32rm directly instead of relying on X86FastISel::X86SelectCmp, which emits an unnecessary SETCCr unconditionally. Hence, LLVM avoids generating SETCCr when there is nobody “waiting” to use a materialized result of the comparison. Bingo!

But wait, the next iteration of the for loop in SelectionDAGISel::SelectAllBasicBlocks is going to pick up %cmp = icmp eq i32 %0, %1 (remember, the loop is going backwards). It will pass the instruction to X86FastISel::X86SelectCmp and we know that one emits SETCCr. What gives?

To answer this, we need to focus on two more spots in the LLVM sources. The first one is inside of X86FastISel::X86SelectBranch. The execution reaches it when it’s impossible to fold a conditional branch with a comparison.

// Otherwise do a clumsy setcc and re-test it.
// Note that i1 essentially gets ANY_EXTEND'ed to i8 where it isn't used
// in an explicit cast, so make sure to handle that correctly.
Register OpReg = getRegForValue(BI->getCondition());

The method FastISel::getRegForValue — through a chain of calls — puts the comparison instruction BI->getCondition() into a map SelectionDAGISel::FuncInfo::ValueMap.

The second one is a call to isFoldedOrDeadInstruction at the very start of the for loop’s body in SelectionDAGISel::SelectAllBasicBlocks. isFoldedOrDeadInstruction runs a number of checks to determine if the instruction should be skipped. As one of these checks, it looks up the instruction in SelectionDAGISel::FuncInfo::ValueMap. If the instruction is there, the appropriate handler is invoked. If it is not, the loop skips to the next instruction.

// Do FastISel on as many instructions as possible.
for (; BI != Begin; --BI) {
  const Instruction *Inst = &*std::prev(BI);

  // If we no longer require this instruction, skip it.
  if (isFoldedOrDeadInstruction(Inst, *FuncInfo) ||
      ElidedArgCopyInstrs.count(Inst)) {
    --NumFastIselRemaining;
    continue;
  }
  // [ ... ]
}

To reiterate, if X86FastISel::X86SelectBranch can’t fold a comparison into a conditional branching, it places the comparison instruction into the SelectionDAGISel::FuncInfo::ValueMap. So when SelectionDAGISel::SelectAllBasicBlocks picks that instruction for processing, it known the instruction is “alive” and invokes X86FastISel::X86SelectCmp to emit the corresponding MIR instructions as usual. But in case a comparison has actually been folded into a conditional branching, the comparison is not placed into ValueMap, thus SelectionDAGISel::SelectAllBasicBlocks skips it on the next step.

Conclusion

I guess, it’s fair to say, there is no generalized optimization technique in LLVM that gets rid of redundant comparison result materializations. Just “a little careful coding”…

And thank you for reading! (Hey, if you just jumped to the bottom without reading, that’s OK too — seriously, don’t feel bad about it.)


  1. How come, we loaded a into a register, but b we’re referencing simply by its address? Why go the extra mile when we can compare two addresses directly? Well, that’s because we can’t. (Try the veal, by the way). x86 machine code can’t encode an instruction with two explicit memory operands↩︎

  2. Clang, of course, doesn’t directly generate assembly — it produces LLVM IR↩︎

  3. Or rather machine code, but we won’t touch on this topic here. ↩︎

  4. Clang doesn’t actually wait to parse the entire source file to build its IR representation. Instead, it parses one top-level declaration at a time, emits IR instructions for it, and then moves on to the next one. Once the IR module for the source file is complete, it gets handed off to LLVM’s codegen pipeline. If you’d like a deeper dive, a good spot to start from is void clang::ParseAST(Sema &S, bool PrintStats, bool SkipFunctionBodies)↩︎

  5. Well, according to LLVM’s documentation it actually is an assembly language↩︎