Flutterby™! : HTML and regex

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

HTML and regex

2013-10-09 16:29:47.419539+00 by Dan Lyke 5 comments

Sigh: Saw the "don't parse HTML/XML with regex" question again: http://stackoverflow.com/quest...except-xhtml-self-contained-tags

The first answer was cute, but I was reminded that anyone who thinks you can parse HTML or XML without doing some serious regex fix-up first has never actually used any XML or HTML in the wild, where the parser answer is almost always "this isn't valid XML and I can't deal with it, sorry!", and can just ST-right-the-FU.

[ related topics: Web development Content Management Woodworking ]

comments in ascending chronological order (reverse):

#Comment Re: HTML Parsing made: 2013-10-09 16:59:16.241933+00 by: Jack William Bell

I once had a six week project to write a custom HTML parser/DOM in C#, meant to run on ASP. I got it working with support for all the elements and most of the ways people could screw up HTML.

It was reasonably fast, but it used a LOT of memory because I wrote it as recursive descent parser. I coded the basic parser and DOM in a week. The other five weeks were fixing it with special cases until it supported all the unit tests. Tail recursion to the rescue!

#Comment Re: made: 2013-10-09 18:02:16.293182+00 by: Dan Lyke

I used to worry about memory consumption, but now I run "top", hit "M", and see that Firefox is using 669m with 3 tabs open.

There are platforms where memory matters, but anything where C# is a reasonable development option probably isn't one of 'em.

#Comment Re: made: 2013-10-09 20:04:48.885032+00 by: Jack William Bell

Sure, but the parser was supposed to run on a server. One of the other devs complained about the memory usage and claimed he could code up something with Regex that would do the same thing but use a lot less memory.

The lead dev (who was a young guy) was smart enough to say "No you can't."

#Comment Re: made: 2013-10-09 21:56:34.979492+00 by: Dan Lyke

The Flutterby system parser uses regexes, but in a way that's suspiciously parser-like. On the other hand I've tried to rewrite it in Flex and got caught in a maze of twisty little state machines, all weird.

But even on a server, 6 engineer weeks will by a shload of RAM.

#Comment Re: made: 2013-10-09 22:25:18.687246+00 by: meuon

I've been flamed by some engineers at metering companies for the way I parse their big fat XML files. Then I delete a few million records and re-suck in their XML for it. My limit is how fast I can insert into the SQL server.. ;) The trick I found is to only load as XML a chunk at a time. Technically not correct, because I am not looking at the entire XML file at the same time, but it sure works well.

Example: 50k records all chunked by meter#. Parse the XML file in chunks from <meter to /meter> and load that XML up (using simplexml in PHP usually) and parse it. repeat until done.

I've also learned to not try to do it when they upload a file via an API. Just write that input to disk and parse it with a completely different process, usually triggered or just cron'd. This is useful because I can fix code for errant data after the fact. Accept the garbage in, sort through it for what we need later (1 to 5 minutes is real time in this world).