2005-07-12

PostgreSQL experience and how it made me use BerkeleyDB

I've tried using PostgreSQL v.8 to process a large amount of web server access logs. The processing has a large internal state that has to be kept between successive accesses. I have to say I was disappointed how unsuitable for the task PostgreSQL is. Namely:
  • pgSQL procedural language doesn't have static variables. This means that I have to reinitialize some variables upon each entry into a function, OR pass a bunch of values to helper functions.
  • Python and perl procedural languages don't make cursors available. they first store the whole query result in the server's memory and then operate on it. Should I say that I've run out of memory on the server?
  • PostgreSQL doesn't provide cursors for update, so I had to execute an UPDATE for each returned row after calculating some statistics. SELECT/calculate/UPDATE really kills the database even when you have all the indexes. Also if it had cursors for update, I could save state in temporary tables and pass cursors for update to helper functions instead oflarge parameter lists.
All of these points make it extremely hard to write maintainable server-side data-processing functions. After having to debug ~250 line pgSQL function (and I had to use pgSQL because it is the only one which supports cursors) with a bunch of 'RAISE NOTICE' statements, I've gotten fed up. Not to mention it was unbearably slow. This experience made me realize why companies are still buying commercial databases like Oracle.

I have also experimented with gigabase, but it also had problems on its own: You either have a choice of either:
  • Index access getting rapidly slower on each insertion.
  • Give up on insert/select programming. If you want the fast index insertion, then the insertions are not visible to SELECTs until you commit the transaction. And commiting often again becomes rapidly slow.
Finally, I've decided upon a proven solution: BerkeleyDB. It offers a hash access method (the official PostgreSQL docs discourage using hash indexes) and is extremly fast (I've managed to do ~90k search/insert pairs per second). The drawback is not having a convenient query language like SQL, but I can live with that. The queries I have to perform are not too complex.

Combining the C++ STL containers with a few wrapper classes for convenient access to BerkeleyDB makes me pretty quick in writing custom query programs. Not to mention the processing speed - orders of magnitude faster than PostgreSQL.

No comments: