Flutterby™! : XML too hard? 2003-03-19 18:11:02.836069+00

XML too hard?

2003-03-19 18:11:02.836069+00 by Dan Lyke 7 comments

Headlining it as dan over-estimates intelligence of coders, on my oft-quoted "XML is the subset of SGML that Microsoft's developers could understand" remark, John over at Genehack points to Tim Bray's XML Is Too Hard For Programmers. He's got some good points, but I think he misses the big one: XML is harder to write than to read. If you realize that XML will generally have a specific application, then building a system that just throws exceptions when the XML doesn't map nicely to the data structures you were trying to build with is just fine. Yes, we don't know how to do this easily yet, but we have a bunch of idioms, and the better ones will start to filter out shortly. I've written at least one system which creates Cstructs and was a joy to work with. Well, as much of a joy as C code can be.

The hard bit is that writing valid XML is non-intuitive and often means fixing far-reaching data quality issues that involve high level meetings and getting the database administrators to do stuff. And because the accountability is one level removed, those writing XML often don't have to eat what they've cooked, it's a frustrating feedback loop to get people to get their output correct.

I've still yet to figure out why people think XML is so cool, and I'm not sure if the fault is XML's or mine. I haven't had -- nor have I wanted -- much contact with XML. The impression that I've gotten is that it's a nice idea with some good uses, but has been horribly overhyped and abused in the past few years. Anybody care to enlighten me?

People think XML is cool because it LOOKS easy. To write. The people who have to take it apart later know better.

I maintain a widget that takes commerce data (customer orders) from various vendors' web sites and parses it so a record of the order can be stored in our internal (SAP-based) system. Because this is a web widget and I am in the "scripting camp" whenever possible (god, I loved that part of the article; I'm going to put it on my wall - I have never heard the groups that clearly defined before), it's in Perl. Because every vendor wants to use a different kind of commerce server, it was clear from the beginning that we were going to have to use some sort of transaction standard.

We originally used OBI - and in fact have two vendors still on that; my code has two separate parsing/hashing units, one for OBI, one for XML. Now, if you don't know OBI it is a fairly flat delimited format - with two non-printable separator characters, one to keep apart "segments" or records, the other to keep apart "subsegments" or fields within that record. Parsing this in Perl is really little more than a matter of two split() commands, and because the first subsegment of a segment always describes what type of segment it is (thereby giving you a built-in spec for the uses of the other subsegments, which vary by segment type), it's pretty easy to process further/store/etc.

But the vendors, in order to properly use the segments and put things in meaningful places, have to wade through a hundred-page book which looks like it was written by a very large and quasi-governmental committee, because it was. This thing reads like the worst defense contract you ever saw. Oh, yeah, and since the data block contains unprintables and some other potentially nasty characters for web transmission, the whole thing has to be base64 encoded for transit, which flusters some people. None of them liked using it.

So the most recent four vendors - say, in the last two to three years - have used XML. For this I had to write a rudimentary XML parser from scratch - I couldn't find one at the time although I'm sure someone has made a Perl standard library I could carve down to a useful size since then.

My parser is specialized: I always need to read and store the entire structure; there is no inessential data. Everything the vendor sends is germane to me or they wouldn't be sending it. So here's my dirty hack, for the curious (if you're not, skip the next paragraph and go on to the moral of the story).

I keep track of tag nesting and assemble nested tags as I go in the form of a string which holds the nesting levels with slashes between. That is:
<OrderRequest>
<OrderRequestHeader>
<OrderRequestParty>
is, internally, OrderRequest/OrderRequestHeader/OrderRequestParty. When I dig down to real data, I store it in the data hash using the assembled tag string to that point as the key. This works fine for uniquely tagged data. Most commerce XML specs require either uniquely tagged data or specially tagged repeating data, the latter which looks like this:
<ListOfPartNumber>
<PartNumber>1</PartNumber>
<PartNumber>2</PartNumber>
</ListOfPartNumber>
So whenever I find a ListOfSomeName, this is a cue for me to catch all the inner SomeName tags (it'll always be the same name, at least in the XML spec I use) and handle them specially. I don't store the ListOfSomething tag at all, I just tack on a sequence number to parse out later:
$myhash{"PrecedingTags/PartNumber:1"} = 1
$myhash{"PrecedingTags/Partnumber:2"} = 2

OK, and welcome back those who skipped the geeky paragraph. The point here is, it was clear that all those vendors who shunned the OBI we offered were intimidated by its APPARENT complexity and the fact that XML looked so innocuous - I mean, it looks like HTML; it's pretty human-readable; it doesn't appear to contain any hidden pitfalls. And they got slapped in the face with it - the XML vendors, on the whole, have needed two to three times as long to get the data they're sending us up to speed, they have had immense difficulty wrangling the format and/or making changes to the data later when necessary. The people who bother to absorb that horrible OBI spec, OTOH, find that making changes to it is relatively trivial. It just LOOKS horrible.

I'm not defending OBI exactly - it was a messy, unwieldy standard - but I'm not defending XML either. OBI tried to be all things to all people. XML tries to be nothing to anyone.

Todd: Yep, XML doesn't make the semantic information of the data any easier to deal with. The only thing it does is specify hierarchy and impose some character set restrictions. Thus it can make for an easier framework when you start to use non-ASCII character sets.

It isn't any more human editable than the right tools and a binary format. It embeds no code. It contains little information itself. It's harder than you'd think to write conforming output, which leads to lots of misunderstandings and "oh, I'll just dump the database to a structure that looks like...

It's the Modula-III of markup languages. All that strict typing is probably a good idea at some level, but it's a royal pain in the @$ when you're trying to get a team to communicate with it.

Ok. Perhaps I'm positively biased since I've never personally had to push bits on this, but it seems to me that everyone makes too much of this, which may be a side effect of not quite *getting* it (which in turn is because no one "gives" it correctly.

XML is a language for defining *other* "little languages" which you use to wrap data for transport.

No, XML per se doesn't help on the semantics front. But, by definition, you can't *work* in "raw" XML, you *work* in the language defined (or maybe declared?) by a DTD... and *that* *does* define semantics.

Other_todd, your piece sounds like it's saying that OBI should be easier than XML, which should be ugly... but then it goes on to say pretty much exactly the opposite. Huh? :-)

Jay: Strictly speaking, I think that the semantics are given by the processing application. And the mixed-blessing of XML is that, unlike SGML, it's parseable without a DTD or schema. Mixed because this means that XML can be transmitted without context, but that also means that it's easy to parse garbage.

So XML is a format. XML schemas and DTDs are the language for defining the other little languages. And if you look at XML as just a format, much like comma separated ASCII, then it's an overly complex format with some "gotchas" that jump out of nowhere, and because everyone is using a "conforming parser", those gotchas are fatal rather than just data corrupting.

In some cases "fatal rather than data corrupting" is good, but in lots of applications, web pages, many applications of databases, whatever, it's better to get wrong results than to get no results.

Heh. Sorry that was unclear. What I was trying to say is that OBI looks ugly and hard but turns out to be fairly simple to work with; XML looks easy and attractive but turns out to be surprisingly hard.

In my XML example there IS a declaration. I mean, I work without a DTD because my DTD is basically built into my hasher - somewhere in the code is a section which says, "Ah, we have read [raw tag named X] and we must store that as [data field named Y]" - my code sets up the correspondence between tags and meanings, which is what the DTD would otherwise do. These days *I* am the DTD provider as I tell new vendors "Here is our spec, do it this way." With our first XML vendor we sort of took the xCBL commerce spec and narrowed it down to something we could agree upon.

By the by, one thing I have heard a number of vendors (who may or may not Get It) express grievances about is the plethora of commerce specs. "Why can't we all standardize on one spec?" they cry. Sometimes flexibility isn't everything, I guess.