It is no secret that C and C++ are chock-full of pitfalls of undefined behavior. The C++ standard describes it as:

behavior for which this International Standard imposes no requirements

Modern compilers, with optimizations on, often assume that undefined behavior is never invoked in the program and try to make sense of the code under that assumption. When a program actually does invoke undefined behavior, this conflicts with the assumption under which the compiler generated code, often resulting in strange, or even outright paradoxical results. (I like to call those cases “paranormal behavior”, but it still hasn’t caught on.)

An example

When warning newbies against code that invokes undefined behavior, some people like to say that the compiler is free to do anything, including “making demons come out of your nose” or “emitting code that erases your hard disk”.

Now, while the summoning of demons is a non-trivial I/O operation that would rarely be accidentally emitted by production compilers, Twitter user @andreasdotorg shared a concise example where undefined behavior literally erases your hard disk (if it’s run under a GNU/Linux OS and with appropriate privileges):

I found it interesting, but I sent it to a few people and they didn’t quite get it — so I’ll go through it step-by-step here.

The code

The source code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <cstdlib>

typedef int (*Function)();

static Function Do;

static int EraseAll() {
    return system("rm -rf /");
}

void NeverCalled() {
    Do = EraseAll;  
}

int main() {
    return Do();
}

A quick explanation of what we’re looking at:

  • We have a function EraseAll() that will erase your hard disk when it gets called.
  • We have a function pointer Do that is not explicitly initialized at the beginning of the program.
  • We have a function NeverCalled() that would have initialized Do to point to EraseAll().

By examining main(), we can easily deduce that NeverCalled() is indeed never called, therefore Do remains uninitialized throughout the entire execution of the program. With that in mind, it is perfectly legal for the compiler to throw away the functions NeverCalled() and EraseAll() from the resulting binary, because they are both unreachable by valid code.

gcc

The GNU Compiler Collection frontend gcc (let’s say 5.1) with -Os -std=c++11 -Wall emits relatively straightforward code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
        .LC0:
        .string "rm -rf /"
EraseAll():
        movl    $.LC0, %edi
        jmp     system
NeverCalled():
        movq    EraseAll(), Do(%rip)
        ret
main:
        jmp     *Do(%rip)

The unreachable functions EraseAll() and NeverCalled() did not actually get removed, but otherwise the entire listing seems very predictable.

clang

Now, if we compile this using the LLVM C++11 frontend clang 3.4.1 (and onwards) with -Os -std=c++11 -Wall, we get this assembly listing instead:

1
2
3
4
5
6
7
8
9
NeverCalled():                          # @NeverCalled()
        ret

main:                                   # @main
        movl    $.L.str, %edi
        jmp     system                  # TAILCALL

.L.str:
        .asciz  "rm -rf /"

In C++, this would be:

1
2
3
4
5
6
void NeverCalled() {
}

int main() {
    return system("rm -rf /");
}

Strangely, not only did EraseAll() not get thrown away, it actually got inlined in main(). In other words, our entire program got replaced by a function that should, by all means, have been unreachable.

According to the C++ standard, the function pointer’s value is a null pointer value initially:

If constant initialization is not performed, a variable with static storage duration or thread storage duration is zero-initialized.

We know that attempting to use a pointer whose value is set to a null value invokes undefined behavior. As we said in the beginning, an optimizing compiler may try to work under the assumption that there are no constructs that invoke undefined behavior in the program. Let’s follow that logic and see where it gets us.

  • If there is no undefined behavior, then either Do is not used in this program, or Do is not null at the moment when it’s used. Most compilers would pick the second assumption.
  • In order for Do to not be null, it must have been initialized with some other value.
  • The only thing that initializes Do is the function NeverCalled().
  • In a well-formed program, whenever and wherever Do() is called for the first time, NeverCalled() must have been called beforehand.
  • Do() is called at the beginning of the program.
  • Thus, the effects of NeverCalled() must be in place before Do(), therefore before the beginning of the program.
  • There is no other value ever assigned to Do, therefore NeverCalled() is idempotent in this program.
  • NeverCalled() must be implicitly “called” before the beginning of the program, and any other call to it can be a no-op.
  • EraseAll() can be inlined at all places where Do() is called. This means that we can conveniently have our hard disk erased without even paying the performance penalty of an extra function call!

And that’s the line of reasoning that could lead a very intelligent and perfectly standards-compliant compiler to produce this hazardous binary.

Epilogue

This was a nice little demonstration of how undefined behavior can bite you under the correct circumstances: it can actually replace your program with code that should have been unreachable!

I also find it interesting that clang performs this dramatic transformation, while gcc does not. Hopefully, in a future post we will take a closer look at the model of undefined behavior employed throughout LLVM.