Flutterby™! : What to do with the archives?

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

What to do with the archives?

2012-05-24 14:07:06.503643+00 by Dan Lyke 8 comments

A spammer who's trying a little harder sent me an email last night about how one of my archive links was busted, and would I like to replace it with a link to their content farm. I went back to look at the archive entry, and apparently Clay Shirky is breaking the web. Which would be fine if it were one person, but this will probably be the 15568th entry on Flutterby (ignoring that when I went to a database backed blog I coalesced a whole bunch of "hey, these guys updated this week" updates), and a whole lot of that back catalog is now nothing but noise, contributing to the link spam, and not really contributing much.

Back when Flutterby started, I had this naive notion that by pointing out things of value on the web I was adding to the whole, creating a better place. Now, I think I'm just helping the link and content farm creators as they shovel yet more bullshit on us.

There are possible technological hold-overs I could apply to this, I've thought several times of trying to go back and categorize links and do some sort of trick on rendering where I kill links to the spammers and farmers and such, but...

I think the web is different now. Everyone's off to Facebook and Twitter and Metafilter and not updating their own site any more. We're back to more walled gardens rather than a vast interconnected knowledgebase.

I don't think the archives of Flutterby are adding value, but I'm not sure how to move forward. Should I hide everything over a year except to logged-in users? Remove them entirely? Attempt some heavier content management? Any suggestions?

[ related topics: Interactive Drama Content Management Weblogs Spam Monty Python Graphics Databases Archival Gardening ]

comments in descending chronological order (reverse):

#Comment Re: made: 2012-05-25 16:19:48.984397+00 by: Dan Lyke

The comments are shut off some period (I forget what it is), unless you're already logged in, in which case they're there indefinitely. Comment spam really hasn't been a problem, there have been a few questionable cases that I've left (given some of the things that have shown up in "links to participants", I think there's a bit of an incentive that isn't just community), and I occasionally catch and delete some with prejudice (we do have 5000 or so registered users, I think...), but I overall the content people add to this site is less a concern for me than the degradation of other sites.

And trying to classify what's gone, especially when it doesn't 404 but redirects to a link farm, is hard to do programmatically.

The topic system gives somethin like Wordpress's tags, although in reverse chronological order. The problem is that I got out of the habit of editing them, and nobody else ever did.

#Comment Re: made: 2012-05-25 15:23:44.864+00 by: mkelley

I've kept my blog archive, but comments shut off after 30 days of the original post.

#Comment Re: made: 2012-05-25 11:49:26.622974+00 by: DaveP

Dan, one of the things I keep meaning to do for my own archives is walk them and either flag links as dead (and maybe try to point to archive.org) or give some indication that the link still works.

There's also something to be said for the WordPress style "show me all the entries with the tag 'Foo' in chronological order." I'd be a lot more interested in that for myself if I'd been better about tagging things over the years.

A simple thing for you would be to simply close comments on old posts (or close the content entirely) for non-users. I see a lot of sites (MeFi for example), where old threads are completely read-only.

#Comment Re: made: 2012-05-24 23:42:13.180948+00 by: spc476

Dan, it's nice to know you'll keep the current system. I've never been tempted to switch to another CMS myself, if only because the current crop of CMS software appears to have so many features they interact in odd ways to cause a continuous stream of security holes.

On the archives, one feature I've been threatening to add to my blogging engine is an automatic forward linking mechanism---when I link to a previous entry, that previous entry will have a link to the current entry automatically generated. I haven't added that yet because the thought of having to crawl my own archive to update everything is daunting. But I suppose I could just add the feature and forget about updating everything in the past ...

I would just keep the "nofollow" for links outside flutterby. Google will honor it, and the spammers are optimizing for Google anyway.

#Comment Re: made: 2012-05-24 23:20:29.427173+00 by: Dan Lyke

Oh, and, on "...in much the same way Zork is great", I keep thinking about a re-design, and then think "but... that's not Flutterby...".

However, every time I look at a replacement system, I come down to a few things: All the other platforms I've found suffer from both the "big monoculture = big target" problem, and a set of security holes that I accounted for when I first wrote this system a decade ago. And the CMS I'm using for Flutterby.net uses this formatting engine because every other one I've looked at does things that just seem wrong.

=

#Comment Re: made: 2012-05-24 20:48:58.242345+00 by: Dan Lyke

Thanks for pointing out the obvious, Ben.

Okay, old links (> 1 year) and comments are now rel="nofollow". At some point I'd like to do something smarter, but that was a 20 minute fix. Now at some point I need to make sure that I'm doing intelligent things with the headers so Google sees that they've been changed. Since I used the "updated" field which I send in the header, maybe I need to muck with etags or something?

The "smarter" involves trying to track which links are still good and doing an alternate link to a "this is the history, here's where you might find it in archive.org, ..." page for the links which aren't. This probably involves caching the HTML, because re-validating links on the fly is kinda heavyweight, and then I have to build a bot or some mechanism to go back and validate a couple of tens of thousands of URLs.

Eric, let's talk a bit more about threading through the archives. I have some architecture issues with "forward" and "back" buttons (that, admittedly, could be fixed). I'm also interested on your ideas on referrer linking; I've had both Trackbacks and Referer(sic) logging implemented, but the spam on those was so heavy I couldn't keep up with it. Maybe I need a "next/prev by date" and "next/prev by topic", and then see if I can figure out which ones we haven't cleaned up the topics on...

I'll also see about tweaking the search parameters (and maybe actually indexing that search at some point!)...

#Comment Re: made: 2012-05-24 17:29:11.480185+00 by: Ben Williams

Why not just make old links rel="nofollow"? The content farms probably aren't benefitting so much from the traffic you are sending them as they are from the Google juice they get from the link.

#Comment Re: made: 2012-05-24 16:25:15.303191+00 by: ebradway

scratching

I guess the deeper question is how much of Flutterby is the community and how much is the technology?

The archives add to the community, but, to be frank, the way the archives are implemented make it harder than necessary to realize the benefit. I'd love to be able to "thread" through the archives in posts. I would also love to be able to refer link into the archives from Facebook and elsewhere.

Yes, I know if I can get the URL for the entry in the archive, these things are (relatively) easy. By (relatively), I mean that using HTML HREF tags in plain text is nicely consistent with the rest of the web but is circa 1996 content management desiden. And searching the archives has never been great. For instance, using the search for "television" results in an arbitrarily sorted list giving the entry number and title but not the date, who started the thread, etc. Using Google for the same search is a little better but Google positions hits on the keyword in the comments the same as the keyword showing up in the title.

A stop-gap for the spam problem may be to kill all in content over a year old to non-users. Links would be restored when logged in. You could replace all links in old content to a login page on Flutterby. This should stem the spammer/farmer issues while preserving the old content.

But I'd love to see some modern CMS features. I wonder if it makes sense just to port the content to an off-the-shelf CMS. Flutterby is great... in much the same way Zork is great.