2005-12-10

The nonsense of XML

Rick Jelliffe is proposing in his blog post "Unicode-aware CPUs" instead of Binary XML. Personally, I think that this is nonsense. With the recent talks about binary xml I'm even more convinced that XML as a data format as such including all of its companions, such as the XML Schema and XPath, should be abandoned.

With XML the data is transformed from machine ("binary") representation into XML greatly inflating its size in the process, transmitted to the other end, possibly some XSLT rules are applied, the resulting data checked against some schema and then reparsed back into the binary form. XSLT and XML Schema are programming languages in themselves, just with a hideous syntax. XML was supposed to be editable by hand without any special tools, but this is, IMHO, nearly impossible with XSLT and XML Schema. So their existence defeat the purpose of XML.

Other proponents of XML claim that the XML is self-describing. I claim that this is also nonsense. Actually, there are two levels of self-description. The 1st is the description of data types (e.g. the height tag must be a real number greater or equal to 0). This purpose is fulfilled by the accompanying XML Schema. The 2nd level is the interpretation of the data. This is done exclusively by the programs manipulating that data. Nobody can say (without actually inspecting the underlying program code) that the height tag doesn't actually specify the age of some item. The 2nd level of self-description cannot be achieved by any means.

Now, the binary XML is, IMHO, superfluous in itself. There already exists a perfectly good data description format and its binary encoding rules: the ASN.1 and BER, DER and PER (Basic, Distinguished and Packed Encoding Rules), with a good solution for namespace problems (the OID hierarchy). Why invent another one?

I propose to use a standardized virtual machine code as a means of data exchange. One good VM and accompanying language for such purpose is Lua because of its simplicity, portability and power as a programming language.

The data rarely exists in some vacuum. It is often manipulated in many ways. So why not transmit a VM program that creates the data? Instead of transmitting e.g. the XML chunk <height>20</height>, why not transmit a piece of VM code that, when executed, stores value 20 in the field height of some table? In the same way that programs currently expect specific XML tags in an XML document, they would expect certain names of tables and fields in the VM. Transformation and validation of the data would be other VM programs.

Lua has a distinct advantage here: simple declarative syntax for textual representation of data and programs, and a portable binary VM code for representing the same program. For the purposes of debugging, one would send textual representation of the data, and in the production one would send binary representation. Lua offers unified interface for loading both the VM and textual chunks of code. As far as I remember, the binary format is also portable to various architectures.

With restricted VMs it is also easy to address security concerns. The only remaining concern is the denial of service: one can e.g. construct an infinite loop that represents a piece of "data". Detecting infinite loops before executing the program is provably undecidable problem - the Halting problem. This problem can be easily solved by setting a maximum time limit on the execution time of the program (and discarding the data as invalid if it is exceeded), or by forbidding control-transfer instructions in the data description chunks of the code.

To conclude - XML is far too heavyweight, ugly and overrated. Wake up people, look around and search for alternatives.

Tags: XML

4 comments:

Anonymous said...

I agree with the general idea that XML sucks. I do not agree that an executable format would be the answer. Executable formats are too powerful for their own good. If you want to transfer data you want to be sure it won't contain some virus so it definitely shouldn't be executable.

While I love S-Expressions for Lisp-coding I don't think XML and S-Expressions are that different, basically you have a slight syntactic difference of the same tree structure.

What is wrong with a simple textual format with a comment describing the fields at the beginning of the file and one fields per line, records separated by empty lines?

All databases use flat tables (no trees) anyway, so the knowledge how to store one in the other is all available and worked out.

IMO a data format can't be simple enough to parse. XML actually did away with all the advantages of text-based data formats as it doesn't really work together that well with the Unix command line tools to manipulate text data (sed, grep, awk,...)

zvrba said...

If you disallow the code to execute control-transfer instructions, then it's not too powerful. VM code can be nicely sandboxed and many properties of it can be proven. I have watched the MS Research video on their Singularity OS. They don't use HW for protecting kernel from userspace, but static code checkers. If the code compiles, then it is safe. Of course, one could always have bugs in static checkers, etc., but this is a neccessary risk. Not greater than mistakenly interpreting data.

While I agree that XML did away with all the advantages of text formats (although the XML proponents are unwilling to admit that), text formats are a) too verbose, b) often lack a "standard" parser. This leads to proliferation of ad-hoc and often buggy parsers. Look at how many XML parsers are out there. Speaking of text formats, do you know about YAML (http://www.yaml.org)?

Verbosity of text formats IS becomanig a concern today, especially in small wireless devices such as sensors. You have limited bandwidth and limited power (compressing the text uses much CPU and costs battery power).

Anonymous said...

XML solves the problems that it was designed to solve (a simplification of SGML, which was a way to represent the kind of semi-structured data used in multi-format publishing) very well. If it doesn't solve your problems well, it doesn't suck; you're using the wrong thing for the wrong job.

During the dot com boom, transaction-oriented programmers took on XML and built the horrible W3C Schema Language around it. They shoehorned XML into a place where it didn't belong (they wanted element declarations to map easily to Java declarations so that they could automate the creation of their B2B apps, etc.), and now people like you blame the bad fit on the original design because you think it was designed by and for Java programmers in 1990. It wasn't.

"Binary XML" would be an extension of that same mistake, and your idea of standardized virtual machine code is the best alternative I've heard of.

XSLT: it's ironic that I found your post in a link from reddit, which is typically gaga over anything related to LISP and Scheme. XSLT is descended from those, and not from Algol, like all the languages you're comfortable with, and it gets terabytes of useful work done every day by people who are perfectly comfortable with it. Get over it.

zvrba said...

Oh, but I'm very familiar with lisp and scheme and I find them very natural :) I was mostly complaining about the XSLT syntax.

I've never written a larger project in either lisp or scheme because of the lack of many supporting libraries.. but I like their philosophy, they changed the way I think about programming. Now I'm mimicking LISP in many of my other programs written in non-LISP languages.