2005-07-31

A GSM routing experiment

Once, a few years ago, me and two of my colleagues (I'll call us A,B,C) have made on purpose a circular call diversion on our GSM phones: A -> B -> C -> A. We were interested in what will happen if any of us gets called. All of us could normally call anybody without a problem so we called a 4th colleague to call one of us from a fixed line. The result - an automatic machine saying that the person is temporarily unavailable.

Since we were allowed to make circular diversions in the first place, I think that there does not exist a central database of all diversions. They are executed only when needed. So, how smart are these GSM switches in loop detection, and what are their limits?

Consider the same experiment, extended in several ways:
  1. There also exists an 'unavailable' diversion which takes place only if the person is, well, not available. This makes it possible to create more complicated graph topologies than simple closed rings. The more complicated the topology - the longer it takes to find a 'leaf' node by DFS (one not involved in any cycle).
  2. Invent a suitably complicated topology.
  3. Involve more, many more, people. On the order of few hundreds to few thousands. Instruct them how to arrange their call diversions. Explain to them that the experiment will cost themnothing.
  4. Get those people spatially close, in groups of about 40. I think that a single base station cannot handle more than about 30 users. So all of them will (maybe) overload a single base station. Depending on the software.
  5. Maybe involve people from different countires and/or providers in the same country. They can't share data about the (un)availability of a user.
  6. Perform a call from the fixed line to a well-chosen user in the arranged 'diversion graph' and observe what happens.
I can envision several outcomes:
  • The software simply does not respect the 'not available' diversion if it detects a loop. This leaves us with a simple ring topology and nothing interesting will happen. The caller will simply get the 'user not available' message.
  • The software has a maximum limit on the depth of DFS search for the final divert destination. So at some point, the cycle will be artificially broken by the built-in limitation and the phone of some user that is in the cycle will ring.
  • The software is really smart and traverses the whole graph before concluding that it is cyclic and replying by 'user not availabe'. A lot of the operator's bandwidth could be consumed in the process, maybe temporarily disrupting the service.
  • The operator service totally breaks down, either due to software failure or due to the overloading of base stations (too busy communicating trying to find the end user and not servicing 'real' calls).
If this is read by some engineer who has designed a GSM switch and/or software... please, do not hesitate to leave comment about its inner workings!

On the other hand, if someone seriously decides to make real the experiment I have just described.. I volounteer to invent an 'interesting' topology :)

2005-07-26

Strange NetBSD behaviour, C++ allocators and containers

I guess I'm one of the few people who have stumbled upon a very unpleasant implementation detail (some even call it a bug) of the NetBSD (2.0.2) buffer cache. The symptom is that the ioflush daemon (kernel thread actually) that is responsible for flushing dirty pages back to disk sometimes goes crazy.

I have done some internet searches before posting to the tech-kern NetBSD mailing list, but didn't come up with anything useful. You can read my complaint and the discussion that it has generated here.

The scenario in short: I have a data-processing application that is using BerkeleyDB. I have set up the BDB environment so that it uses mmap-allocated cache (512MB) backed by the real file on the file system. Pages belonging to this file get dirty quite quickly and the ioflush gets active relatively often. When that happens, my program stops (in the 'vnlock' wait channel) and does no useful work. I can't even access the file system where these files resides because of some locking issue (also mentioned in the thread above).

I have recognized that the problem might be in locking and at first I had blaimed the in-kernel NFS server. So I've recompiled the kernel, but the problem persisted.

As you can read in the referenced thread, I was so desperate about the problem that I was even considering to install another OS. Fortunately, in a moment of inspiration (after a good night sleep) I have remembered that I've seen SYSV SHM mentioned in the BDB documentation that I had browsed some time ago. Recompiling the kernel to allow 1G of SYSV shared memory and reconfiguring the BDB 'environment' to use SYSV SHM instead of mmap()ed file-backed storage worked around the kernel problem.

You might ask how didn't it happen before - because I haven't been using the BDB 'environment' until few days ago. In that mode, BDB allocates its cache memory by malloc() which eventually uses sbrk(). Unfortunately, in NetBSD you can't have more than 1GB of sbrk() heap (but you can mmap() almost all of 3GB available to user applications) and the BDB competed for the heap memory with my memory-hungry application.

One day maybe I'll implement C++ allocator on top of AT&T's vmalloc and corresponding discipline to allocate memory by anonymous mmap(). In C++ every allocator is bound to an object type which is fixed in size. This is ideal use for vmalloc's vmpool method for allocating objects of the same size within a memory region. This has several advantages:
  • no need to maintain free block lists
  • does not produce internal fragmentation (so it wastes less memory than 'ordinary' malloc implementation)
  • it is faster
Writing a corresponding C++ wrapper around Cdt might also be a good idea. When I was developing hashed text utilities (you can find it on freshmeat) I found out that Cdt is both faster and has less memory overhead compared to the C++ STL.

Oh well. Now I have everything working reasonably efficiently (both in time and memory) so that I can do my work. These plans I'm leaving for later. Or I'll do it while waiting, if I'm forced to switch to disk-based hashes.

It's a shame not being able to use full 3G virtual address space in NetBSD using standard C and C++ libraries.

2005-07-23

Note to self: lexicographic ordering

The following (pseudo-C++) code is the correct way of writing lexicographic "less than" function for the tuple of two items:

bool less_than(const pair &p1, const pair &p2)
{
if(p1.first == p2.first)
return p1.second < p2.second;
return p1.first < p2.first;
}
This code cannot be rewritten using only boolean operators. My mistake was writing

return
(p1.first < p2.first) ||
(p1.second < p2.second);
This does not work! Analogous combination with && also doesn't work. It could be rewritten to use the ?: operator, but with doubtful benefits. Note to self: don't try being too smart again!

2005-07-21

GNU - The Microsoft of the UN*X world

[Remark: please read 'extreme' adjectives/adverbs/pronouns like e.g. 'everyone' with an implicit "almost" in front]

GNU has produced some truly great pieces of high-quality software like gcc, binutils, emacs, screen, texinfo, GnuPG, wget... But IMHO it has some grave failures that have unfortunately become too wide-spread to become extinct soon. And it's not only these failures that are worrying me.

My main complaint here goes to the autotools suite (autoconf, automake, libtool). I have seen configure scripts break badly too many times on somewhat obscure platforms (e.g. SFU - Microsoft Services for UNIX) and not so obscure platforms. When the configure script breaks or "guesses" something wrong - God help you in trying to debug the monstruous script (some of them grow over 1MB in their size)! There are many little utilities which are smaller than their corresponding configure script.

And if you want to rewrite the original configure.in, you have to find very specific versions of auto{conf,make}. But the first obstacle you are going to face is the M4 macro processor which you have to learn to use before you even try to repair the broken configure.in.

A friend of mine (also a programmer) had once defended the autotools as being excellent and had blamed programmers who can't write proper configure.in which would work on all platforms. And his argument brings me to the title of this post.

What had made things worse for us non-Linux users, is the vast popularity of GNU/Linux. Now everyone just assumes that you are running Linux. Developers write programs that don't compile (mostly thanks to autojunk, er, sorry, autotools) on UNIX-like platforms which are not Linux. They just assume you have the GNU suite of tools or, worse, that you are running Linux. Companies develop their software only for Linux (for few very specific distributions) and usually don't support more rare platforms like FreeBSD. Some good open-source software, e.g. the Cyclone compiler I've mentioned in one of previous posts, is also Linux-only and fails to compile under FreeBSD.

The sad thing is that, when you are alrady supporting Linux, usually it isn't much trouble to support other OS-es (sometimes maybe with less features).

Do you see the similarity pattern here? Up until very recently, everone assumed that you are running Windows!

I understand that it's hard for commercial software producers to support all myriads of Linux distributions and several BSD flavors. But there is also a bunch of programmers who contribute to the 'GNU effect'. Who don't take enough effort to write a nice, truly portable program, but just assume that "everyone is GNU/Linux".

In that respect GNU has become the Microsoft equivalent of the UNIX world. You either run GNU/Linux or don't run the software you would like to. Being a BSD user, I don't like that. And no, I don't want to run Linux. If I was choosing my platform based on software availability, then I would run Windows.

I'm not a Windows user. And it has nothing to do with its price, (non-)freedom or disliking Microsoft. I could even afford myself both a legal copy of Windows and some of less-priced Microsoft's developer tools. I'm not using Windows because I don't like their philosophy - the Windows GUI "shell" is written with the assumption of a lame (er, sorry, technically inexperienced) user.

[Side note: hm, people didn't complain about DOS and I guess were more productive with DOS. I'm amazed how a 40-ish year old lady can make a correct plane-ticket reservation just typing in the mainframe console something that looks like gibberish to me.. I guess it has to do concentrating on what you're doing and not just clicking around.. but that's a theme for another post.]

Using Windows I feel like I'm running around a small 1x1 m^2 room while using UNIX is like.. well, I could compare it to piloting a shuttle - it requires great skill but you are free to do whatever you want the way you want.

[Another side note: for the same reason I dislike Java - it is crippled C++, stripped of features that Java's designers deemed "too dangerous". The result is a crippled, single-minded language playing your nanny.. But this is also a theme for another post.]

Now you probably expect of me to write why I prefer BSD to GNU/Linux. I won't. The short story: it has to do with excellent software engineering. For the long story, take a look for example here. I fully agree with everything Matt is saying in his article.

Until next post...

2005-07-19

GnuPG and PKCS#11 and GNU extremism

Some time ago (November 2004, to be exact) I have released a patch to GnuPG which makes it possible to use 3rd party smart-cards with their PKCS#11 drivers with GnuPG.

Mr. Werner Koch (as I get, it he's one of (the?) main developers of GnuPG) didn't like it, saying there are licensing issues with run-time linking of proprietary PKCS#11 DLLs and GnuPG which is, well, GNU copylefted. You can find the whole discussion thread here. Nevertheless I have continued to work on it and the state where I have abandoned development is documented here (as well as in the patch itself). The reasons for abandoning the project could be summarized shortly in the following points:
  • Lack of suitable quality hardware/software PKCS#11 implementation under linux. I tried to find some software PCKS#11 token which offers a relatively complete API support, but.. I found only Mozilla's which is way too complicated to install.
  • Nobody on the gnupg-users list seemed really interested in the project.
So because of Werner's opinion on licensing, the patch did not get it into the GnuPG mainstream (even though I had offered to polish it). IMHO, what he failed to take into account is that there might be a GNU PKCS#11 implementation which, when linked into the GnuPG, wouldn't produce undefined licensing issues. In the mean time I have totally switched my areas of interest and don't even have the hardware available. So for the time being, the development is frozen on my part.

What's funny is that in the last week (8 months later) I have been contacted twice by different persons regarding the development status of the patch.

To one person from a certain company I offered help in patch integration (i.e. explaining my design decisions and how it works). After I have pointed him to the Werner's explanation of licensing issues, he asked whether I knew if GnuPG is perhaps dual-licensed. After I've replied 'to my knowledge, it's GNU-only', I have not heard from him later.

Another person has made modifications to my patch and actually made it work with Schlumberger Cryptoflex 32k, both with OpenSC's and MUSCLE's PKCS#11 library. Great news! I don't know the precise details yet whether he made possible to use more than 1 key pair. With his modifications he has solved some things more elegantly, but his changes conflict with my initial design decisions. And he also doesn't agree with Werner's interpretation of licensing.

This story is just another example of the GNU extremism hurting the end user. The latest example are Microsoft's libraries for parsing their new Office XML formats. As far as I have understood, these libraries are essentialy free (as freedom and beer), except that Microsoft wants a credit in programs that use its libraries. To the GNU people, this is not 'free' enough. Personally, I see nothing wrong in asking credit if somebody is using your work nor do I consider some work less 'free' because of such a requirement. Even if I would be required to return all changes to the original author - it's still free to me.

I guess the extreme point of view makes it much easier to construct a GPL-style license: lawyers need not bother with thinking about reasonable exceptions and wording them in a manner that can sustain a court challenge. Sadly, the lawyer's laziness combined with the GNU extremism hurts only the end user..

Currently, I'm employed in academia and I'll try to release my work under a BSD-like license. The academia is living on money coming from taxes paid by companies and citizens. It is fair to give in return to the society work unencumbered with GNU license so it can be freely used by the industry. If the industry is able to make money from the work of academia - it is good for everyone. And industry is reluctant to use GNU software with a good reason - they would be required by the license to open up their complete source which may contain trade secrets. The LGPL license is IMHO a step in good direction, but it is too rarely used.

Speaking of freedom - GNU should not be called free, as in freedom. It requires[1] you to make available any source code changes of any GNU licensed program, if you distribute that program. What's worse, if you use only a bit of GNU code in an otherwise your original work, your complete program is considered GNU.

[1] Since when is requiring someone to do anything called freedom? Oh yes, you have the freedom to not use the software in the first place. But then again, you have such a freedom with any kind of software.

Don't get me wrong - the GNU project has produced some marvelous pieces of software. But people like mr. Werner Koch are overdoing it. Of course, he should take care that GnuPG contains only GPL code. But what will happen in the event of its execution and runtime-linking to some proprietary PKCS#11 library, should not be his concern. It should be the concern of only the end-user.

2005-07-18

Proven again

Once again I've proven that programmers really are notoriously bad at guessing where their program is slow. I've written a program in C to sequentially read records from BerkeleyDB, and it was as slow as Python. Meaning, BerkeleyDB is the real bottleneck. I had really believed that sequential reading from recno database should be almost as fast as reading from a sequential file. But, it seems that it is able to retrieve only ~1M records / 10 seconds.

Anyway, that was just an experiment. I'll be using flat files for mass data analysis.

So why would one use BerekeleyDB recno databases instead of flat files? I can think of several reasons:
  • automatic management of concurrent access,
  • logical record numbering and quick access to record by number,
  • allows for variable-length records. The previous point is hard to realize with flat files and variable-length records.
A digression: based from these measurements, I wonder how did they implement recno internally. I guess it is some kind of B-tree with record number -> file offset. I'm too lazy to look into the source.

Another note: Knuth in one of his volumes (I don't remember which) mentions that balanced AVL trees can be used for simulating linear lists, but with much better time complexity for randomly accessing an item.

Current update: I'm just testing the Python performance of reading fixed-length records from a raw binary data file. It manages to read the data at ~14MB/sec. With unpacking the data into internal representation, using the struct module, the speed drops down to ~3MB/sec. Using psyco doesn't help either.

2005-07-16

On kids and programming

For the past two days I'm fighting on a Croatian #linux irc channel with a guy nick-named Neuromanx. He is a teacher at high school in a small town of Ivanić Grad. He is also working in a local computer club with smart elementary-school pupils, preparing them for competitions in programming. He boasts himself to have made many of his pupils champions on national competitions in programming. All of his pupils achieved their excellent results programming in Basic. So what am I fighting with him about? He wants to teach them C and/or C++ before they enroll in the high school. And I'm talking about kids 13-14 years old!

I think that it is a bad idea, both for teaching them a more "powerful" language and for making them more successful in competitions. These are my reasons:
  • One needs to know fundamentals of computer achitecture to be successful in C or C++ - you know, stuff like memory addressing, two's complement and such. Unless you're comfortable with low-level concepts like bits, bytes and machine addresses - don't even try to understand how pointers, multiple pointers, function pointers or integer arithmetic rules work (e.g. you can easily get a negative number by adding a positive number to another positive number!).
  • Knowing assembler helps. I have grasped how pointers work in C only after I did some programming in assembler. After writing a few programs in assembler and trying to interface ASM functions with C.. came that 'a-ha!' effect after which I really grasped pointers.
  • C compiler messages are often baffling and unintuitive (just try forgetting somewhere the terminating semicolon.. often you get anything but the simple error message in the lines of "semicolon omitted at line..."). C++ is even worse in that respect, especially if you mess up something with the STL.
  • C and C++ allow you to do much, but rarely warn (unless you explicitly tell them to) that what you're doing is probably wrong. They are very forgiving. In other words it is extremly easy to shoot yourself in the foot and can be extremely hard for a novice to figure out why the program is failing. It is possible to make almost any syntactically correct C code (a bit harder with C++) to compile and execute.. but chances are that it won't do what the programmer intended. Examples of such traps are signed-unsigned comparisons, implicit type promotion rules and operator precedence. Even I, considering myself an expert in C after having a decade of experience with it, make stupid mistakes like writing !x&1 instead of !(x&1) to test whether an integer is even.
Yes, C and C++ are powerful. Coming from the UNIX world, they have the same philosophy as UNIX - the user (programmer) is always right. That's why they provide dangerous constructs like type casts. For that same reason (among others) there is no undelete on UNIX (and rm asks "Are you sure?" only if you explicitly tell it to!) - because the user knows what (s)he is doing! This is excellent if you are an expert and really do know what you are doing. It just gets out of the way and enables you to do what you want without fussing around. But is very unforgiving if you don't know exactly what (and how!) you are trying to achieve something.

I am myself a strong proponent of Pascal for teaching someone to program. It was my next language after C64 BASIC and I have never regretted it. I have first started to use Oxford Pascal on C64. After it I have moved to Borland Pascal 7 on a 486 under DOS. The IDE was the best I have ever worked in (to this day I have to say!), included an intuitive debugger, the compiler was blazingly fast with friendly error messages. And it was extremly easy to use - just run the IDE, type your program and compile it! Modern C IDEs come nowhere near the BP in any aspects I have just mentiones.

Let's return to Pascal. Why would I recommend it to anyone wanting to learn to program? Because it teaches you discipline. I believe that you have to discipline yourself if you want to become master of anything, including programming. Pascal forces the discipline on you. Unlike C:
  • it was designed for teaching programming,
  • it is a very strict language in which you can't accidentaly make a semantic error like in C, therefore
  • it doesn't stand sloppines (e.g. no implicit conversions),
  • encourages you to learn structured programming,
  • doesn't bog you down with details of pointers and memory management (which in C you encouter in the very beginning - just reading something from the keyboard with scanf needs a pointer),
  • has more verbose, but more clear syntax than C to the newcomer,
  • and has one feature that I sometimes really miss in C: built-in set data type.
In my young days, I have also participated in national programming competitions. And yes, I've used Pascal every time. No, I've never been even among the first top three contestants. But I don't blaim it on the Pascal. I simply did not have enough theoretical knowledge and imagination. If I have an idea of how to solve the problem, I can do it in any of the many languages I know now (C, C++, perl, python, Pascal, a bit of Scheme, LISP and Java, Prolog, etc.)

Neuromanx says that he (and everyone around him) is 'allergic' to Pascal and says that C is more 'powerful'. He never did give a rational argument why Pascal is 'bad'. After me saying that you can do anything in Pascal (esp. Borland Pascal, Delphi and tehir extensions) as well as in C, his response was in the lines of: "You can also do with a sheep anything you can do with a woman." If he was alluding to sex he forgot that you can't have a child with a sheep :)

After this sentence (coming from a school teacher!), and his short-sightedness, I wonder myself whether he really cares what happens with these kids later, or he just wants to make a larger number in his personal record of 'produced' state-champions in programming.

And my statement holds firmly for the types of programs you are writing on competitions: a bit of file/input output (simple formats, no problem for parsing) and algorithmic tasks.

Let me also say that I have been teaching algorithms and data structures course on (a Croatian equivalent of) college. On that college the students first use Visual Basic and then they immediately move on to Visual Studio. And most of them are having much trouble. The C is too overwhelming to them with its details, and they are as much struggling with a complicated IDE as with C. And these are older people (at least 20) who did learn abot 2s complement and such stuff before coming to C.

Neuromanx is claiming that these kids are smart, but... I doubt it that they are that smart that they can successfuly grasp C and apply C with their level of background information.

I would compare his plan to.. putting someone to fly a plane, after he had experience only with driving a bycicle.

So in what am I currently programming? C and C++ of course! Why not Pascal? Well, unlike Neuromanx, I have my arguments and this article summarizes most of them nicely. Let me cite two sentences from that paper:
"I feel that it is a mistake to use Pascal for anything much beyond its original target. In its pure form, Pascal is a toy language, suitable for teaching but not for real programming."

I fully agree. And when I was learning to program in Pascal, I have never really noticed its shortcomings. I have to add that some of Kernighan's points do not hold any more with Borland's extensions to Pascal.

And of course, as I have written earlier, I consider myself an expert in programming, I know what I'm doing and now I want the language to be out of the way. In my learning days, I needed all the help I could get, and the Pascal compiler was really helpful.

The jump from Pascal to C.. was pretty easy once I have grasped the concepts of procedures, functions, loops, etc.. you know, the basic elements of structured programming. And the programming discipline I had acquired with Pascal saved me a lot of grief with typical beginner errors in C (which the students at the aforementioned college are also often making).

To conclude: I am really glad I have learned Pascal before C. And it had made me a better programmer overall.

2005-07-14

STL debugger

In the meantime, one reader has brought to my attention the Etnus TotalView debugger. Too bad I don't have the time to play with it. Maybe some time in the future.

Doing actual work...

Remember that yesterday I promised to tell you how long that Python actually took to execute? Well, I didn't wait long enough to finish. I've killed it after two hours of running. I've tried to run it with psyco. Actually, I've obtained an almost 10x increase in speed in that simple program - disk usage was at ~5MB/sec. However, this is still too slow. For 6GB at that rate, it would take around 20minutes. This can go a lot faster!

Today it turned out that I've uncovered a bug in the Phoenix-2 code. Joel has fixed it really quickly. However, I'm still struggling with Phoenix. Each time I think of some novel way of composing functions, a new problem arises. I've checked the Joel's time zone in the mail headers. He's in -7, I'm in +2. Meaning that I'm waiting half a day for an answer to a problem. (Hey, I'm not saying that he's slow.. he has to sleep, afterall! :))

Playing with Phoenix and simulating LISP in C++ is way cool. But I have real work to get done. So I have fallen back to writing the STL-style functors and this is going very smoothly. And when Joel manages to answer my questions, I will adapt it to Phoenix. That's what I'd like to use eventually, because it is much more expressive than standard C++ STL-style functors. BTW, he thinks that it's cool what I'm trying to do.

A digression: yesterday evening when I was experimenting with Phoenix, I've received a 1MB-sized (checked with the wc program) error message from g++! My next thought was: "with a sufficiently complicated (with many deep template instantiations) and broken C++ source you could almost use the compiler error output as an entropy source!" :)

And C++ definetly needs a new tool: template metaprogramming debugger. No human can really read and understand the 1MB error message! I can find the major points and the source of error, but why it happens.. I don't have a clue. Joel knows his library inside-out so I suppose he knows where to start looking in the code and how to debug it. I'm totally lost.

And GDB is also awful when dealing with templates. Just try to set a breakpoint in a specific instance of some template-generated function.

2005-07-13

On programming languages

Last days I'm fighting with C++ templates. I'm trying to write a mini-DSL for querying and filtering streams of data. The idea is to have provider operators. Each time it is called, it provides a new value. Using the Boost Phoenix library (a part of the Spirit framework) I've come close to being able to write something like the following:

until_eof(
mapcar(
binary_istream_provider(f),
_1 * 2),
cout << _1 << endl);

However.. writing the custom functors is very hard. The Phoenix-2 library has.. well.. nice documentation for users, but not so nice when you are trying to extend it. And each wrong attempt results in screenfuls of C++ gibberish of template instantiation errors. Actually, I'm starting to understand that error 'garbage', although not yet how to fix it.

Joel de Guzman, the main Spirit and Phoenix developer, is kindly trying to help me, but currently he's the only one on the mailing list answering my questions. I don't like the feeling of depending on a single person.

Some people have suggested to do it in LISP. Writing higher-order functions in LISP is trivial. But.. The data stored in the datafiles are binary C structs, suitable for direct reading into memory. Operating on such data representation is very cumbersome in LISP.

Another language that comes to mind is Python. It as convenient as LISP for writing higher-order functions and dealing with binary data is relatively simple (thanks to its struct module). But it is slow. Sequentially reading ~160 millions of records from a BerkeleyDB RECNO database is taking forever! I'm watching the disk bandwidth usage, and it is meager 0.5MB/sec! And the CPU consumption is 100%. BerkeleyDB can't be the bottleneck since it is very fast with C or C++. The disk obviously has greater bandwidth. That leaves us with a single bottleneck: the Python itself.

Why can't we have a nice, compiled, portable language compatible with C data types, which makes life easy for writing higher-order programs? Something like D or Cyclone. Unfortunately, both fail the premise of portability (e.g. I'm running NetBSD and FreeBSD..)

Oh well.. Maybe I'll just go back to the old, type-unsafe and very powerful way of representing everything with functions taking unspecified parameters (the f() parameter list; this is not the same as f(void) in C!) and void pointers. With a few typecasts inserted here and there, everything should work fine!

Oh, BTW, that Python is still running. More than half an hour and still not finished. I've managed to write this post while waiting.. I'll post here the final time.

2005-07-12

PostgreSQL experience and how it made me use BerkeleyDB

I've tried using PostgreSQL v.8 to process a large amount of web server access logs. The processing has a large internal state that has to be kept between successive accesses. I have to say I was disappointed how unsuitable for the task PostgreSQL is. Namely:
  • pgSQL procedural language doesn't have static variables. This means that I have to reinitialize some variables upon each entry into a function, OR pass a bunch of values to helper functions.
  • Python and perl procedural languages don't make cursors available. they first store the whole query result in the server's memory and then operate on it. Should I say that I've run out of memory on the server?
  • PostgreSQL doesn't provide cursors for update, so I had to execute an UPDATE for each returned row after calculating some statistics. SELECT/calculate/UPDATE really kills the database even when you have all the indexes. Also if it had cursors for update, I could save state in temporary tables and pass cursors for update to helper functions instead oflarge parameter lists.
All of these points make it extremely hard to write maintainable server-side data-processing functions. After having to debug ~250 line pgSQL function (and I had to use pgSQL because it is the only one which supports cursors) with a bunch of 'RAISE NOTICE' statements, I've gotten fed up. Not to mention it was unbearably slow. This experience made me realize why companies are still buying commercial databases like Oracle.

I have also experimented with gigabase, but it also had problems on its own: You either have a choice of either:
  • Index access getting rapidly slower on each insertion.
  • Give up on insert/select programming. If you want the fast index insertion, then the insertions are not visible to SELECTs until you commit the transaction. And commiting often again becomes rapidly slow.
Finally, I've decided upon a proven solution: BerkeleyDB. It offers a hash access method (the official PostgreSQL docs discourage using hash indexes) and is extremly fast (I've managed to do ~90k search/insert pairs per second). The drawback is not having a convenient query language like SQL, but I can live with that. The queries I have to perform are not too complex.

Combining the C++ STL containers with a few wrapper classes for convenient access to BerkeleyDB makes me pretty quick in writing custom query programs. Not to mention the processing speed - orders of magnitude faster than PostgreSQL.