Secure programming

Today I had an interesting talk with a student.. He was interested in the new features of the AMD64 architecture. When I mentioned the non-exec page support, he started to think about writing secure programs. And then he got me thinking... and what follows is the result of this thinking.

First I'd like to clarify the difference between an incorrect and an invalid program. [What I'm using may not be the standard naming, but I was lazy to look it up on the web]. The following simple code demonstrates both points:

float average(int *a, int n)
int sum = 0, i;
for(i = 0; i <= n; i++)
sum += a[i];
return sum;

The program is incorrect because it does not return the average at all - it just returns the sum of the elements in the array. What makes this program invalid is that it is accessing the array a beyond its supposed limit. There are also examples of invalid programs that behave correctly in a certain runtime environment, but break in another - e.g. ones that access free()d memory.

There is no way that any programming language or runtime environment can identify an incorrect program. Doing so would require both:

  1. An understanding of programmer's intentions (some kind of AI?), and

  2. detecting that the code does not do what the programmer intended. This is provably impossible - it is equivalent to the halting problem.

Many of today's exploits are due to invalid programs (e.g. web servers, ftp servers, public forums, etc.), and there exist many mechanisms to defend against them: the processor's hardware protection features, basic UNIX security model and its extensions like ACLs, various security patches like grsecurity, SELinux (integrated into the Linux 2.6 kernel) or TrustedBSD.

Given proper support by a programming language, invalid programs may become near (if not completely) to impossible to produce. Examples of such languages are python, Java, Haskell, Ocaml, lisp, .NET runtime, etc. All of them are more or less slower than C in raw performance by varying degrees of magnitude (check out this page for exact numbers). Some of them come very close to C. But in many real-world scenarios (moderately loaded web and ftp servers) raw performance is of little importance and security and data integrity is of paramount importance.

Now, I'm not saying that grsecurity or SELinux are useless and that they can be replaced altogether by choosing better programming language. I think that they should be the second line of defense against security breaches, not the first as it is today. Reasons why is the second line of defense needed at all:

  1. The compiler and the runtime system can be buggy themselves, and

  2. There are always authors of malicious programs that do their deeds mostly in unsafe languages and assembly.

What follows is are, in my view, some of the reasons for using an unsafe language still today:

  • Portability. I claim that C is the most portable programming language in existence today. (If someone gives a "counter-example" in the lines of compiling Win32 source on a Linux system, they fail to see the difference between an API and a programming language). Because of that,

  • C is the main language for interfacing to foreign libraries and the operating system itself. On every platform it has a well defined ABI, something that all other languages either lack, or express it by C interfaces.

  • Sometimes it is easier to code the whole program in C than make bindings for you particular language of choice. (Although tools like swig help considerably with the task.)

  • Raw performance.

  • Not wanting to use Java, and managers consider other languages too exotic.

C is the lingua franca of computing world. IMHO, it is here to stay until a feasible alternative appears. Personally, I don't see it on the horizon yet.


Alan said...

Interesting ideas, but I wonder why you think it is so difficult to discover a programmer's intentions - surely these can be automatically discovered from a test suite ?

Granted, there may be no test suite, and then we are back to AI, but that is just a reason to insist on unit testing.

OTOH - if there is no test suite then I would despair of any AI being able to discover intentions. I've been a devloper and a teacher for decades, and I would find it hard to determine intentions from pure code - I don't think any AI is going to be doing it soon ;-)

zvrba said...

So, how do you think a computer could identify a certain piece of code as e.g. a sorting routine?

As for unit testing: while it is certainly useful, it is not a way of proving that something is correct. The fact that a piece of software works for N inputs, doesn't guarantee that it will work for all possible inputs. Unless your N test cases really do exhaust the domain of your problem.

Even if the domain of your problem is sufficiently small that it can be exhaustively tested, you are still left with a (in the general case,infinite) number of invalid inputs that should also be properly handled.

In the end, who is supposed to debug the unit tests? :)

Anonymous said...

Once thing you're overlooking is that it's wrong to contrast C (and C++) as "unsafe" languages to Python, Perl, Ruby, Lisp, and such as "safe" ones. While these other languages are more impervious to typical programmer fuckups, such as writing past the end of a simple array, they too make it possible to write invalid (as well as incorrect) code. Of all the mentioned alternatives to C, I believe only Java actually aims to make it impossible to write invalid code, at least as long as you stick to 100% Java.

For example, in Lisp, it is undefined behavior to modify a constant list, such as created with (let ((foo '(a b c))) ...). In both Python and Perl you can very easily attain invalid code, typically by incorrectly using a "thin" interface to a native facility. Another way is by pushing the interpreters to their limit, for example by abusing infinite recursion. Or by encountering implementation deficiencies, such as regexp engine coredumps with large regexps. (The "toy" languages such as Perl and Python typically have most of their low-level functionality implemented in C to make up for their atrocious execution; this makes their internals susceptible to exactly the kind of bugs any other large C program faces. Lisp and Java at least have real native code compilers which remove the necessity for their innards containing large chunks of unsafe C code.)

Invalid code is a possibility in all languages (except in 100% Java, barring JVM bugs), it's just that C and C++ make it easy to screw up while doing things that should really be easy -- like accessing arrays.