Undefined Behavior in C / C++
System programming languages such as c grant compiler writers freedom to generate efficient code for a specific instruction set by defining certain language constructs as undefined behavior. Unfortunately, the rules for what is undefined behavior are subtle and programmers make mistakes that sometimes lead to security vulnerabilities. As an example of undefined behavior in the C programming language, consider integer division with zero as the divisor.
Some languages such as C/C++ define many constructs as undefined behavior, while other languages, for example Java, have less undefined behavior. But the existence of undefined behavior in higher-level languages such as Java shows this trade-off is not limited to low-level system languages alone.
C compilers trust the programmer not to submit code that has undefined behavior, and they optimize code under that assumption.
The C FAQ defines “undefined behavior” like this:
Anything at all can happen; the Standard imposes no requirements. The program may fail to compile, or it may execute incorrectly (either crashing or silently generating incorrect results), or it may fortuitously do exactly what the programmer intended.
As a quick example let’s take this program:
#include <limits.h>
#include <stdio.h>
int main (void)
{
printf ("%d\n", (INT_MAX+1) < 0);
return 0;
}
#include <stdio.h>
int main (void)
{
printf ("%d\n", (INT_MAX+1) < 0);
return 0;
}
The program is asking the C implementation to answer a simple question: if we add one to the largest re presentable integer, is the result negative? This is perfectly legal behavior for a C implementation:
$ cc test.c -o test
$ ./test
1
$ ./test
1
So is this:
$ cc test.c -o test
$ ./test
0
$ ./test
0
And this:
$ cc test.c -o test
$ ./test
42
$ ./test
42
And this:
$ cc test.c -o test
$ ./test
Formatting root partition, chomp chomp
$ ./test
Formatting root partition, chomp chomp
One might say: Some of these compilers are behaving improperly because the C standard says a relational operator must return 0 or 1. But since the program has no meaning at all, the implementation can do whatever it likes. Undefined behavior trumps all other behaviors of the C abstract machine.
Why Is Undefined Behavior Good?
The good thing about undefined behavior in C/C++ is that it simplifies the compiler’s job, making it possible to generate very efficient code in certain situations. Usually these situations involve tight loops.
For example, high-performance array code doesn’t need to perform bounds checks, avoiding the need for tricky optimization passes to hoist these checks outside of loops. Similarly, when compiling a loop that increments a signed integer, the C compiler does not need to worry about the case where the variable overflows and becomes negative: this facilitates several loop optimizations.
Why Is Undefined Behavior Bad?
When programmers cannot be trusted to reliably avoid undefined behavior, we end up with programs that silently misbehave. This has turned out to be a really bad problem for codes like web servers and web browsers that deal with hostile data because these programs end up being compromised and running code that arrived over the wire.
A less serious problem, more of an annoyance, is where behavior is undefined in cases where all it does is make the compiler writer’s job a bit easier, and no performance is gained. For example a C implementation has undefined behavior when:
An unmatched ‘ or ” character is encountered on a logical source line during tokenization.
With all due respect to the C standard committee, this is just lazy. Would it really impose an undue burden on C implementers to require that they emit a compile-time error message when quote marks are unmatched? Even a 30 year-old (at the time C99 was standardized) systems programming language can do better than this. One suspects that the C standard body simply got used to throwing behaviors into the “undefined” bucket and got a little carried away. Actually, since the C99 standard lists 191 different kinds of undefined behavior, it’s fair to say they got a lot carried away.
So let’s consider another example which was at my exam paper. :)
#include <stdio.h>
/* Two functions include and they are operated by main function */
int *f(int x) {
/* Creates an variable */
int p;
p = x;
return &p;
}
/* Here the initialization of the function g */
int *g(int x) {
/* Creates an variable */
int y;
y = x;
return &y;
}
/* This creates two pointers called x and y */
int main() {
int *x, *y;
/* Here call the functions f and g */
x = f(100);
/* Here call the function g */
y = g(2500);
/* How does it print 2500? */
/* print the value of x */
printf("%d \n", *x);
return 0;
}
The output:
2500
2500
The reason for getting weird output is undefined behavior. I’m returning the address of automatic local variable which will no longer exist once function reach its end.
Although, the explanation for the output can be given in terms of stack frame of function call. Since the last call is for function g and the argument passed to it is 2500, the parameter x of function g is allocated on stack and 2500 is pushed to the stack. When this function return, this value popped from the stack (though the stack frame for g is invalid after return to the caller) and it may return this 2500 from its stack frame.
Living with Undefined Behavior
In the long run, unsafe programming languages will not be used by mainstream developers, but rather reserved for situations where high performance and a low resource footprint are critical. In the meantime, dealing with undefined behavior is not totally straightforward and a patchwork approach seems to be best:
- Enable and heed compiler warnings, preferably using multiple compilers
- Use static analyzers (like Clang’s, Coverity, etc.) to get even more warnings
- Use compiler-supported dynamic checks; for example, gcc’s -ftrapv flag generates code to trap signed integer overflows
- Use tools like Val grind to get additional dynamic checks
- When functions are “type 2” as categorized above, document their preconditions and post conditions
- Use assertions to verify that functions’ preconditions or post conditions actually hold
- Particularly in C++, use high-quality data structure libraries
Comments
Post a Comment