2005-08-05

The case for binary I/O

In C and C++ related usenet groups I have lately seen many advices by the programmers, to other programmers, that they should avoid binary I/O and use text I/O with arguments such as:
  • it is portable,
  • can be easily viewed and edited with an ordinary text editor,
  • is easier to debug.
But text I/O involves a lot of unnecessary formatting and parsing compared to binary I/O. Lately I have chosen to use C++ for a programming project, although it is inferior to LISP for the task at hand, just because I couldn't find any open-source LISP implementation that can easily deal with C binary data types (or other 'raw' machine data types for that matter).

Today I have told to myself: maybe you are wrong. Maybe you are again wrongly assuming that it would be slow. Maybe you could have used text I/O and LISP. So I've decided to set up an experiment which has shown that I didn't assume wrong this time. Read on.

I have written four (two for text and two for binary I/O) trivial C programs that
  1. Write the file of 25 million single-precision (4-byte IEEE) floating-point random numbers between 0 and 1 (both programs were started with the default random seed, so they generate identical sequences). In the text file, the delimiter was a single newline character.
  2. Read back those numbers, calculating an average along the way.
The platform on which the experiment was conducted is a dual 2GHz Xeon with 3G of RAM, running NetBSD 2.0.2. Results may vary on other stdio implementations, but I think that they give a good overall picture.

The results are most interesting; all timings are averages based on three measurements. Timing deviation was noticable only on the 2nd decimal place:
  1. Writing to text file=43.7 seconds; binary file=5.2 seconds; binary I/O comes out about 8.4 times faster.
  2. Reading from text file=13.4 seconds; binary file=3.0 seconds; binary I/O comes out about 4.4 times faster.
  3. File size ratio: binary comes out 2.25 times smaller (100M vs. 225M)
I was surprised by the large difference in ratios in writing between binary and text I/O, so I have written a third pair of programs that just writes constants in the output file, so there is no time spent in the rand function. This time the binary I/O came out 5.6 times faster. This is a much fairer comparison, because it doesn't include rand overhead.

Overall, I think that "use text I/O" cannot be given as a general recommendation. For sure, I don't want to use it because I have files with gigabytes of binary data. Text I/O would be in this case even slower due to more complicated (more than one field) input record parsing.

As I see it, the only thing in favor of text I/O is its portability. The other two arguments in the beginning of this text are just a consequence of lack of adequate tools for binary file viewing/editing. There is a bunch of hex-editors, but none of them will interpret the data for you (i.e. you can't tell it "display these 4 bytes as IEEE floating-point number) or let you define your own structures composed of different data typed.

There is GNU Data Workshop. I can't comment on it as it is written in Java, and therefore do not want to (i.e., on NetBSD, can't) use it. Not to mention that it is GUI and I'm doing all my programing on the server via ssh.

After some searching, I've come across a hex editor that offers some of the capabilities I need and is console-based: bed. However, it has an unintuitive user interface although it is menu-driven.

If you know of a good, non-Java, console-based, flexible binary editor, please drop me a note. Thanks :)

No comments: