Notes for an automated web log manager

On Dave Winer's discussion group (http://www.discuss.userland.com/ ) Jorn Barger wrote a wishlist for a web-log database and a later followup talking about a system to build a standardized database of changes to make linking and searching regularly updated articles easy to do.

My original hope with Flutterby was that it would be the site I wanted to read. I've gotten distracted from that goal, got to thinking about promoting "Dan Lyke" as a brand, and found that I wasn't writing as much as I was hoping too and it's become a glorified bookmark site. Which is fine, but it's time to sit down and write a little code to create some tools to at least handle the recurring stuff a little easier.

So, here for the world to see is the sort of notes I write before I sit down with an editor and a language to build a system.

What I want is a web page delivered to me every day of all of the interesting stuff happening, both real-life and web-wise. I might even want it to keep track of things I have and haven't looked at (ie: comics) and keep them on the list 'til I tell it to go away, but that might be ambitious.

I need a list of URLs with instructions on how to turn them into something useful, how often they should get "ping"ed (if it's a weekly, hit it only after it usually comes out. If it's irregular, don't check it every day if it's never come out more often than every two weeks), when they were last checked and what the results of that check were. I need to know what sort of things I find interesting for each class of postings, and this structure probably needs to be thought of separately because this is more a "per-user per subject" thing than a "per URL" thing.

Some sites are easy, Hack The Planet , Whump , Scripting News all publish XML versions and want to make it easy to automate these sorts of things.

Others are harder, I still haven't gotten into the habit of checking Marylaine Block's Fox News articles because of how they hide it in annoying frames.

And in those cases where I'm looking for raw data (ie: TV listings, possibly the filtering of events calendars, etc) if this gets popular I'm going to be actively thwarted because I'll be bypassing the advertisements (although I have to confess I already turn off automatic image loading so I miss lots of ads already).

So I need to start with a queue of config files, and a database that tells when each one needs to get checked next. I'll run through the file names and check each one in the database, if the database doesn't have an entry for that file name or shows it as needing to be checked, then I'll parse the file, do the check, and replace the database entry with the next time to check based on the results.

I need to build rules by which I find something interesting, for starters this will probably be including and excluding keywords or a simple regular expression search. I also need a tag saying whether I want the entry parsed some way (elements from an events calendar, for instance), or just a link and a "this is how to get the document" (in the future that document will get downloaded so I can drop it in my PalmPilot).

Random notes, if you have features or ideas I'd like to hear them...

There are a couple of different types of sites that I track.

There's NETFUTURE , which has an entrance page that points to the updates via the link whose text is "Current Issue". This is wonderfully easy.

Alewife Bayou is similar, except that the names for the links change, and several new documents can appear in between checks.

Salon has been wasting my time recently, so what I really want from them is a list of articles that I can parse for writers I know or writers who I haven't seen before, and give me the direct (frameless) URL's to those documents.

My Word's Worth updates into the same place every week, but then changes name when it's put in the archive. It'd be nice to have a way to automatically find that and fix links when it goes into the archive.

mouthorgan is similar except that nimue gives archive access immediately (much the way Flutterby works, except that mouthorgan is article based).

Spectator's "Up and Coming" events calendar gives a fairly simple events calendar to parse. I'll want to look at Laughing Squid's announcements list too.


Wednesday, March 10th, 1999 danlyke@flutterby.com