Proven again

Once again I've proven that programmers really are notoriously bad at guessing where their program is slow. I've written a program in C to sequentially read records from BerkeleyDB, and it was as slow as Python. Meaning, BerkeleyDB is the real bottleneck. I had really believed that sequential reading from recno database should be almost as fast as reading from a sequential file. But, it seems that it is able to retrieve only ~1M records / 10 seconds.

Anyway, that was just an experiment. I'll be using flat files for mass data analysis.

So why would one use BerekeleyDB recno databases instead of flat files? I can think of several reasons:
  • automatic management of concurrent access,
  • logical record numbering and quick access to record by number,
  • allows for variable-length records. The previous point is hard to realize with flat files and variable-length records.
A digression: based from these measurements, I wonder how did they implement recno internally. I guess it is some kind of B-tree with record number -> file offset. I'm too lazy to look into the source.

Another note: Knuth in one of his volumes (I don't remember which) mentions that balanced AVL trees can be used for simulating linear lists, but with much better time complexity for randomly accessing an item.

Current update: I'm just testing the Python performance of reading fixed-length records from a raw binary data file. It manages to read the data at ~14MB/sec. With unpacking the data into internal representation, using the struct module, the speed drops down to ~3MB/sec. Using psyco doesn't help either.


Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...

I said, years ago, that flatfile is a LAW :) Nobody believed me :-)