Flutterby™! : Extracting semantics

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

Extracting semantics

2004-02-25 00:31:32.357793+01 by Dan Lyke 0 comments

One of the problems with having domain names lying around is that you have to figure out something to do with 'em. I've realized recently that I haven't linked to anything at Clean Sheets or Scarlet Letters in ages. I find Nerve completely uninteresting in that snarky New York City "It's more interesting to talk about the thing than to do the thing" way. And Spectator used to have some pretty good articles and calendars, especially surprising given that I find little that their advertisers pitched interesting, but that site was uneven even when it was updated. Sexuality.org has David Steinberg if you're not on his email list, but it seems like there's something missing. Exalte.com sent me the politest link exchange request I've ever gotten, enough that I'll actually link to it here despite the fact that my policy is to automatically ditch any "link exchange" requests, but I still find something missing in the "sex magazine" space.

So anyway, I've got a domain name, now all I need to do is fill it with cool stuff. I'm in the catch-22 space, I feel reluctant to print up cards and say "hi! let me interview and take pictures of you!" until I have something to show, but, of course, to get the content I'd have to... One of the things I want is a really cool events calendar. I'm on a gazillion mailing lists, I have a folder full of calendar links, but... I'm a software geek, surely there's a way to automate most of this process?

So yesterday on the ferry I wrote a little script that runs through a block of text and does a pretty good job of extracting semantic information from raw text. I'm a good way through extracting show dates and times, performer's names, venue information including addresses (and finishing that is going to be no mean feat given that burlesque performers in particular seem to think that nothing exists beyond their home borough city).

But I'm to the point now where I'm dealing with ambiguity. It's not bad if I'm just looking for patterns in the text, but as soon as I start to actually be able to quantify meaning in phrases (not jus twords or formatting) then the problem becomes one of chasing several paths and this quickly explodes in some awful combinatorial ways, especially since English uses "." for sentence endings and abbreviations, capitalization for sentence starts and proper names (if I'm lucky), and people do things like use "!" in the middle of names. The code complexity, let alone the compute issues, start to explode. I think it's time to go back into the natural language literature and get a refresher on where things are at. Anyone got a suggestion on an overview of the state of the art?

[ related topics: Photography Sexual Culture Dan's Life Invention and Design Software Engineering Consumerism and advertising Art & Culture New York Burlesque ]

comments in ascending chronological order (reverse):