Flutterby™! : Parse your XML! 2001-11-09 00:13:57+00

Parse your XML!

2001-11-09 00:13:57+00 by Dan Lyke 15 comments

Recommendation that everyone should follow, but nobody (not even me) ever will: Parse your #*(&^@$)%! XML before you publish it!!!!! This will help ensure that it is XML. Thank you.

This reminds me of another adage: 'Make sure your code compiles before commiting to version control'.

The particular thing that got me today was that XML does not encode binary data. In fact, it only allows for 3, count 'em: '\t', '\r' and '\n': 3, ASCII equivalent control codes. Note that there is no form-feed in this.

There will be a quiz later about how you should filter user input data before passing it into systems which expect XML. It is very important to either:

Because it seems like every time Dan implements a system to import XML data from elsewhere he has to write a pre-parser to make sure that the data is XML, thereby obviating any alleged gains that XML is supposed to give us over, say, flat text files.

We had a perfectly working system using flat text files that was fast and reliable. But, the 'magazine article of the month club' syndrom took over me and we rewrote everything into XML. Since DOM was 'wonderful' we used that - after all it allowed random access of the data, so to speak.

Our output file ran to 18 meg. No problem. Importing the file was a different story, however. Using DOM the 18 meg turned into 1 GIG in RAM of XML-DOM and would run for ever just parsing and loading up. 50 times larger? Yes. So we rewrote using SAX and now its fast and happy. Interestingly its just like the text based system we were using except it has to mush through all the tags and such. In fact we used the orginal code from the text import to save time. But now we are 'buzz word' compatible, and thats important ---- ...

BTW we have added another famous saying to our wall: 'It's easy for me.' A classic if there ever was one <g> Right up there with 'The user interface needs a little work' - while looking at a C:\ prompt...

The real advantage of XML is a more consistent character set handling and heierarchical data structures. But given what you give up for that, I'm still not a convert, and it is way

overhyped.

("XML will enable us to..." No, XML might save you some time writing some of your parse code to handle cases you're not used to handling.)

It's also important to only ever use it as a tool interchange format. If humans are reading XML, something is wrong, and if humans are writing XML something is very, very wrong.

After a month in hell growing a limited-scope XML reader for commerce interchange (purchase orders placed with outside vendors get translated into SAP storage through my widget) I am happy to hear that other people think it's more trouble than it's worth as well. I'm sure there's a use for it. But I haven't encountered it yet.

Beware of MS XP Access 2002. MS says its XML able but we would hand it a beautifully formatted hierarchical data structure and import it into MDB (access) and it becomes a bunch of not connected tables. Then if we export back to XMl it gives us a bunch of files, one for each of our, files, but of course not hierarchical at all.... XML Spy handles things a little better.

It won't help with documents from others, but you can hide damn near anything you want in cdata, can't you?

I've always been a little mystified by the hype over XML. Um, great - so it's a generic, (pseudo)standard way to store structured data. That's a nice thing to have; a neat little invention. But in what way is it going to change the world? It's basically just going to let you use a generic parser instead of custom-writing one for each data format you want to use.

I *like* XML, and I plan to use it any time I design a new file format that could reasonably be based on it. But a technological revolution it is not. I'm amazed that there's enough to say about it to fill even one of the many books written on the subject.

Looking at the charset for base64, it looks like that would be a viable means of embedding binary data (apparently no overlap with XML reserved chars). Just a possibility. Might be worth researching.

The biggest problem with XML (and other specifications) is the lack of validation. There's tons of data masquerading as XML that, as Dan points out, just doesn't validate. I don't hold out much hope for that ever changing. Heck, Adobe owns the TIFF format (after having bought it from Aldus) and they can't even write TIFFs that meet their own specification. There have been TrueType fonts shipped by both Apple and Microsoft (the two companies who have published TrueType specifications) that don't contain valid data. It's an endemic problem in the computer industry, and if you complain to someone who wrote out data that doesn't meet the specification, the most common reply you'll get is "well, it works in OUR parser".

As DaveP points out, the real issue is that this wasn't supposed to be binary data. And control-L is valid ASCII used in a lot of places. But XML is not a superset of ASCII. And the really most annoying part is that there is a free, vyer standard implementation of XML: expat. It's used by almost everything, including the Perl XML layer. It doesn't take a whole lot to type:

The other problem with XML is that now that the hypemonsters have kicked in it's in grave danger of garnering all the flaws of SGML. But that's a different fight.

Finally, I'd argue that if you've really got binary data, then XML is the wrong format to be passing it in, a tagged file like QuickTime (and a gazillion other image formats before it) is probably much more reasonable. Or, if it's part of a response, drop it in a composite file much the way HTML does, where the source file has links to the binary components.