Flutterby™! : Optimizing Python 2012-04-17 18:48:02.082067+00

Optimizing Python

2012-04-17 18:48:02.082067+00 by ebradway 10 comments

I've been writing more and more Python code lately. Being an old C hack, I find Python to be a lot of fun. I can write essentially the same code with fewer syntactical errors due to the over-use of punctuation in C. But Python isn't C and my resulting code tends to be pretty slow. For instance, I recently wrote a function to return the next occurrence of a character in a string that isn't inside double-quotes. In C, it's pretty straight forward:

int findUnquoted(char * s, char f) {
  int q = 0;  // Means "not inside quotes"
  
  for (int i = 0; i < len(s); i++) {
     if (!q && *s == f) return(i);
     
     if (*s == "\"") q = (q + 1) % 2;

     s++;
  }

  return(-1);
}

This is pretty snappy because of the way C let's you work with memory pointers. Of course, this is the source of the worst kind of C bugs. This code assumes 's' points to a valid, nul-terminated string. If it doesn't, this could easily become an infinite loop. It also assumes that len(s) is less than MAX_INT, which is pretty safe on most modern systems. The code is parsing XML and you're not likely to be parsing really long XML files on an 8-bit processor. My initial Python version looked very similar:

def findUnquoted(s, f):
  q = 0
  for i in range(0,length(s)):
    if not q and s[i] == f:
      return(i)

    if s[i] == '"':
      q = (q + 1) % 2

  return(-1)

But this was really, really slow. What's much faster is to rely on the string functions in Python (which are implemented in C):

def findUnquoted(s, f):
  p = s.find(f)

  if p < 0:
    return(p)    # Not found

  q = s[:p].count('"')

  if not qc % 2:
    return(p)    # Found with even number of quotes preceding

  # Special case of being inside quotes
  nq = s[p:].find('"')

  p = p + nq + 1 + findUnquoted(s[p + nq + 1:], f)
  
  return(p)

Which is much, much faster but still has some poorly handled conditions, like if f is found inside quotes but never outside. But I am parsing XML files and looking for the closing '>' so I don't expect that condition to ever occur in valid XML. What is interesting is having to think of the problem on a higher level of abstraction. In C, it's a matter of walking through the characters in the string and monitoring my state. In Python, I am considering what the quotes mean relative to the search. In fact, I probably wouldn't have used the modulus operation in C if I hadn't done this Python version first. This is a fairly subtle example of how Python encourages higher abstractions. But when you combine the higher abstractions with the efficiencies inherent in an interpreted, dynamically-typed language, you start to experience quite noticeable boosts in productivity. And carefully written Python need not be much slower than good C code. Next on the to-do list is to play with map() and list comprehensions. }

I'm finding Lua with LPEG (especially the re module) to be a nice combination for these type of parsing problems. It only took 38 lines of Lua/LPEG to parse email Received headers (per the RFC spec) into an associative array (Received fields all broken out) and the majority of that is BNF-esque production rules.

The problem with XML is that it's too successful. It has all the foundations of a Great Idea (tm) but in practice, people tend to forget what a PITA it can be.

I will say it's better than SDTS which has exactly the opposite problem - no one uses isn't forced to under contract. It originated from the same bureaucratic thinking that said all DOD software must be written in Ada.

#Comment Re: made: 2012-04-18 08:40:03.476042+00 by: meuon [edit history]

<reasonstohatexml> <0>It is fat</0> <1>People do stupid things with it</1> <2>This is not valid XML</2> </reasonstohatexml>

Just in case you did not see the issue: The numbers as tag names are usually invalid.

#Comment Re: made: 2012-04-18 04:58:16.048209+00 by: ebradway [edit history]

+1 to a SAX parser -- this is basically the primary (entire?) use case for them.

#Comment Re: made: 2012-04-17 21:52:07.227408+00 by: Dan Lyke [edit history]

I always get confused about the names of the bits inside XML files. I believe I'm trying to get at the attribute contents. Of course, the file probably shouldn't be XML to begin with. There are only four distinct element types (names?) in the entire file (nodes, ways, relations, changesets) but each element has an arbitrary number of tag elements.

The file is big enough that you can't load the entire thing into a DOM. It's also big enough that I really don't want to just load it into a database. I want to be able to pick what I want out of the file.

I believe it's in XML because the "normal" way of getting at this data is via a REST API that returns XML. But the "normal" way only returns a limited amount of data.

So you're trying to grep the attribute and element names of an XML file? Or the attribute contents?

The XML purists will say "don't parse XML with regular expressions", and as long as you have the luxury of being able to reject non-XML files, rather than having to fix them, I tend to agree...

I thought about doing something similar with split in Python. One hurdle I've never gotten over in Perl is "trusting" the grep() command.

What I really want is the text between < and > in an XML file but properly handling quotes. I'm doing this from an arbitrary buffer into a file that doesn't allow random access may or may not contain line breaks. Yes, I have a 500GB XML file with no line breaks - fun! Since I usually have two or three copies of this file hanging around, I keep them zipped up. So my parser handles plain text, gziped or B2zipped files automatically.

What I am doing is trying to parse out objects inside the XML and then pass objects one-by-one to the DOM parser. I am tempted to drop the (overhead of the) DOM parser altogether because there are only five object types in the entire file. I actually have an older version of the parser that doesn't use the DOM parser but I wanted my function to return data structures that look identical to what the OsmApi module returns after parsing the XML returned by OSM. At this point, if I write a new parser that doesn't use the DOM, I can reuse it in OsmApi and get some performance gain there.

Maybe I need to get my code updated in GitHub instead of BSing about optimizations on blogs...

Python opened my eyes to just how cool lambdas and closures are. I was using them in Perl, kinda, but Python made me think about objects in C++ and C# (and, to a lesser extent, Perl) in a different way, where the object orientation was no longer about the objects themselves, but about operations.

And, yeah, it's not like it wasn't there, but I now go out of my way to avoid 'for' and 'while' when operating on lists, trying to think in terms of operations that happen on entire lists rather than elements of lists (even though it is something that happens to lists). So I might end up thinking of something like this as: