Dan rants: What are archives for

In response to Jorn's post to the egroups weblogs mailing list

On Mon, 6 Dec 1999, Jorn Barger wrote: > An average machine upgrade already includes a 4+ gig drive, I believe, > so adding a gig of storage a year at this point is trivial. And it will > soon be economical to harvest not just the links from weblog archives, > but dowload the full articles themselves... if you were going to cache > your favorite five gigs of the Web, what all would you go for?

The reason I don't let "wget" (or Trudger or what-have-you) loose on more sites is that I don't know what to do with 'em when I've got 'em; how to predict what'll be useful or nostalgic in a few years. Heck, I'm having trouble figuring out to do with the megabyte of text describing the sites I've run into over almost two years that I thought were noteworthy.

To go back to that old standby, hardcopy, I've got the results of (it turns out) two decades of information accumulation in my chosen vocation. I've shelves with years of "Computer Language" and "Dr. Dobb's Journal", and a few file cabinets with articles clipped and filed by subject from old Apple magazines.

And probably over 100 shelf feet of books.

A random folder pull yields articles on Conway's "Life" from the December 1978 and July 1981 issues of "Byte" magazine (I was 10 and 13, respectively). Another yields a copy of Raima's db_FILE software that I bought back in 1989.

I live in California, the space to store all this stuff costs me. And this is just the stuff that's made the cut of numerous moves and purges, I no longer have the KIM-1 manuals, or the first 5 years of "AI" magazine or 3 of "Wired", or 99% of the SciFi novels that have graced my shelves over the years.

There was a time, when my collection was smaller, when I knew where to look for stuff. I could go back through the archives occasionally, reacquaint myself with what was where so that when I thought about finding data my library was the first thing that came to mind.

That's no longer the case.

I don't know what the hell I'd do with a half a gig a year of browser cache until I develop some tools to do better with it than the search engines are currently doing with the 'net in general.

I've got a lot more lessons to learn from the data that I culled through this weekend, but I see a few things gelling:

It's not that I seek the non-linear in experiencing text, it's that I seek the non-linear in writing it. Far more interesting than a month of notes is a slice through years: All the articles that have come from one site, all the articles on gender identities, or all the articles and rants on web design. The selection and classification process is going to involve a mix of tools and a lot of human judgement, but if I can find a way to put more of that classification in *inadvertantly* on the front end I'll be better off.

That inadvertant part is important. Two years ago I didn't sit down to say "I'm gonna build an index to the cool parts of the web", today I'm looking at it and saying "Hey, I've got a hell of a start on some indexing and search data that may be useful", and I've done it largely by just having a tool grab and archive the notes I was sending to friends.

But the other thing that leaps out at me is how many of those links I don't miss. Many of them were superceded by better information, better writing. Jokes that were funny the first time around were funny the first time around. It's not like I have to keep anything because there'll always be a better take coming down the pike.

Tools that learn from me are important. Russ Alberry was prescient in his rant about the death of Usenet, it is going to be just a bunch of machines talking to each other, but if we embrace that and tune our tools that might not be a bad thing. Everyone should have a personal search engine that knows their browsing history and learns from their notes, but doesn't stop there.

But I've got no clear answers yet. I've got just over 2000 entries culled from the probably half again of that, and 137 rants and who knows how many additional e-mail messages (my sent-mail spool for the past year is 6.5 meg), that I need to build some odd-angled indices for and chop in various different ways to get a feel for what's there in the hopes that I can find some patterns.

Sorry if this isn't all that coherent, I configured a server for a new company today, a few hours heads down in BIND and Apache and the minutiae of package management and builds on machines with limited resources and Unix administration does nothing for my lucidity.


(Once again being hit by the irony of the concept of alt.sysadmin.recovery...)

Monday, December 6th, 1999 danlyke@flutterby.com