Flutterby™! : HTML and regex

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

HTML and regex

2013-10-09 18:29:47.419539+02 by Dan Lyke 5 comments

Sigh: Saw the "don't parse HTML/XML with regex" question again: http://stackoverflow.com/quest...except-xhtml-self-contained-tags

The first answer was cute, but I was reminded that anyone who thinks you can parse HTML or XML without doing some serious regex fix-up first has never actually used any XML or HTML in the wild, where the parser answer is almost always "this isn't valid XML and I can't deal with it, sorry!", and can just ST-right-the-FU.

[ related topics: Web development Content Management Woodworking ]

comments in ascending chronological order (reverse):

#Comment Re: HTML Parsing made: 2013-10-09 18:59:16.241933+02 by: Jack William Bell

I once had a six week project to write a custom HTML parser/DOM in C#, meant to run on ASP. I got it working with support for all the elements and most of the ways people could screw up HTML.

It was reasonably fast, but it used a LOT of memory because I wrote it as recursive descent parser. I coded the basic parser and DOM in a week. The other five weeks were fixing it with special cases until it supported all the unit tests. Tail recursion to the rescue!

#Comment Re: made: 2013-10-09 20:02:16.293182+02 by: Dan Lyke

I used to worry about memory consumption, but now I run "top", hit "M", and see that Firefox is using 669m with 3 tabs open.

There are platforms where memory matters, but anything where C# is a reasonable development option probably isn't one of 'em.

#Comment Re: made: 2013-10-09 22:04:48.885032+02 by: Jack William Bell

Sure, but the parser was supposed to run on a server. One of the other devs complained about the memory usage and claimed he could code up something with Regex that would do the same thing but use a lot less memory.

The lead dev (who was a young guy) was smart enough to say "No you can't."

#Comment Re: made: 2013-10-09 23:56:34.979492+02 by: Dan Lyke

The Flutterby system parser uses regexes, but in a way that's suspiciously parser-like. On the other hand I've tried to rewrite it in Flex and got caught in a maze of twisty little state machines, all weird.

But even on a server, 6 engineer weeks will by a shload of RAM.

#Comment Re: made: 2013-10-10 00:25:18.687246+02 by: meuon

I've been flamed by some engineers at metering companies for the way I parse their big fat XML files. Then I delete a few million records and re-suck in their XML for it. My limit is how fast I can insert into the SQL server.. ;) The trick I found is to only load as XML a chunk at a time. Technically not correct, because I am not looking at the entire XML file at the same time, but it sure works well.

Example: 50k records all chunked by meter#. Parse the XML file in chunks from <meter to /meter> and load that XML up (using simplexml in PHP usually) and parse it. repeat until done.

I've also learned to not try to do it when they upload a file via an API. Just write that input to disk and parse it with a completely different process, usually triggered or just cron'd. This is useful because I can fix code for errant data after the fact. Accept the garbage in, sort through it for what we need later (1 to 5 minutes is real time in this world).

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.


Flutterby™ is a trademark claimed by

Dan Lyke
for the web publications at www.flutterby.com and www.flutterby.net.