Flutterby™! : HTML and regex 2013-10-09 16:29:47.419539+00

HTML and regex

2013-10-09 16:29:47.419539+00 by Dan Lyke 5 comments

Sigh: Saw the "don't parse HTML/XML with regex" question again: http://stackoverflow.com/quest...except-xhtml-self-contained-tags

The first answer was cute, but I was reminded that anyone who thinks you can parse HTML or XML without doing some serious regex fix-up first has never actually used any XML or HTML in the wild, where the parser answer is almost always "this isn't valid XML and I can't deal with it, sorry!", and can just ST-right-the-FU.

#Comment Re: HTML Parsing made: 2013-10-09 16:59:16.241933+00 by: Jack William Bell

I once had a six week project to write a custom HTML parser/DOM in C#, meant to run on ASP. I got it working with support for all the elements and most of the ways people could screw up HTML.

It was reasonably fast, but it used a LOT of memory because I wrote it as recursive descent parser. I coded the basic parser and DOM in a week. The other five weeks were fixing it with special cases until it supported all the unit tests. Tail recursion to the rescue!

I used to worry about memory consumption, but now I run "top", hit "M", and see that Firefox is using 669m with 3 tabs open.

There are platforms where memory matters, but anything where C# is a reasonable development option probably isn't one of 'em.

#Comment Re: made: 2013-10-09 20:04:48.885032+00 by: Jack William Bell

Sure, but the parser was supposed to run on a server. One of the other devs complained about the memory usage and claimed he could code up something with Regex that would do the same thing but use a lot less memory.

The Flutterby system parser uses regexes, but in a way that's suspiciously parser-like. On the other hand I've tried to rewrite it in Flex and got caught in a maze of twisty little state machines, all weird.

I've been flamed by some engineers at metering companies for the way I parse their big fat XML files. Then I delete a few million records and re-suck in their XML for it. My limit is how fast I can insert into the SQL server.. ;) The trick I found is to only load as XML a chunk at a time. Technically not correct, because I am not looking at the entire XML file at the same time, but it sure works well.

Example: 50k records all chunked by meter#. Parse the XML file in chunks from <meter to /meter> and load that XML up (using simplexml in PHP usually) and parse it. repeat until done.

I've also learned to not try to do it when they upload a file via an API. Just write that input to disk and parse it with a completely different process, usually triggered or just cron'd. This is useful because I can fix code for errant data after the fact. Accept the garbage in, sort through it for what we need later (1 to 5 minutes is real time in this world).