HTML and regex
2013-10-09 18:29:47.419539+02 by
Dan Lyke
5 comments
Sigh: Saw the "don't parse HTML/XML with regex" question again: http://stackoverflow.com/quest...except-xhtml-self-contained-tags
The first answer was cute, but I was reminded that anyone who thinks you can parse HTML or XML without doing some serious regex fix-up first has never actually used any XML or HTML in the wild, where the parser answer is almost always "this isn't valid XML and I can't deal with it, sorry!", and can just ST-right-the-FU.
[ related topics:
Web development Content Management Woodworking
]
comments in ascending chronological order (reverse):
#Comment Re: HTML Parsing made: 2013-10-09 18:59:16.241933+02 by:
Jack William Bell
I once had a six week project to write a custom HTML parser/DOM in C#, meant to run on ASP. I got it
working with support for all the elements and most of the ways people could screw up HTML.
It was reasonably fast, but it used a LOT of memory because I wrote it as recursive descent parser. I coded
the basic parser and DOM in a week. The other five weeks were fixing it with special cases until it
supported all the unit tests. Tail recursion to the rescue!
#Comment Re: made: 2013-10-09 20:02:16.293182+02 by:
Dan Lyke
I used to worry about memory consumption, but now I run "top", hit "M", and see that Firefox is using 669m with 3 tabs open.
There are platforms where memory matters, but anything where C# is a reasonable development option probably isn't one of 'em.
#Comment Re: made: 2013-10-09 22:04:48.885032+02 by:
Jack William Bell
Sure, but the parser was supposed to run on a server. One of the other devs complained about the memory
usage and claimed he could code up something with Regex that would do the same thing but use a lot less
memory.
The lead dev (who was a young guy) was smart enough to say "No you can't."
#Comment Re: made: 2013-10-09 23:56:34.979492+02 by:
Dan Lyke
The Flutterby system parser uses regexes, but in a way that's suspiciously parser-like. On the other hand I've tried to rewrite it in Flex and got caught in a maze of twisty little state machines, all weird.
But even on a server, 6 engineer weeks will by a shload of RAM.
#Comment Re: made: 2013-10-10 00:25:18.687246+02 by:
meuon
I've been flamed by some engineers at metering companies for the way I parse their big fat XML files. Then I delete a few million records and re-suck in their XML for it. My limit is how fast I can insert into the SQL server.. ;)
The trick I found is to only load as XML a chunk at a time.
Technically not correct, because I am not looking at the entire XML file at the same time, but it sure works well.
Example: 50k records all chunked by meter#. Parse the XML file in chunks from <meter to /meter> and load that XML up (using simplexml in PHP usually)
and parse it. repeat until done.
I've also learned to not try to do it when they upload a file via an API. Just write that input to disk and parse it with a completely different process, usually triggered or just cron'd. This is useful because I can fix code for errant data after the fact. Accept the garbage in, sort through it for what we need later (1 to 5 minutes is real time in this world).
We will not edit your comments. However, we may delete your
comments, or cause them to be hidden behind another link, if we feel
they detract from the conversation. Commercial plugs are fine,
if they are relevant to the conversation, and if you don't
try to pretend to be a consumer. Annoying endorsements will be deleted
if you're lucky, if you're not a whole bunch of people smarter and
more articulate than you will ridicule you, and we will leave
such ridicule in place.