Flutterby™! : Simple OCR

Next unread comment / Catchup all unread comments User Account Info | Logout | XML/Pilot/etc versions | Long version (with comments) | Weblog archives | Site Map | | Browse Topics

Simple OCR

2009-04-22 05:22:07.003959+02 by Dan Lyke 3 comments

Okay, got some older PDF files that are scans of paper documents. Tried:

convert /home/danlyke/Desktop/fund-summaries-part-1-thru-s-68.pdf fundsummaries.pnm

gocr fundsummaries.pnm

and got some leet speak-ish text out of it, but nothing great. Anyone done this?

comments in descending chronological order (reverse):

#Comment Re: made: 2009-04-23 11:12:24.830951+02 by: DaveP [edit history]

Hey, whenever there's a "let someone else do the work" solution, I'm all over it. :)

#Comment Re: made: 2009-04-22 17:02:37.649896+02 by: Dan Lyke

Aha! Thanks, Dave. I'll need to dig a bit to try to find the particular documents I'm interested in (they've already been indexed), and I did find the "Comprehensive Annual Financial Report" for the "City of pßtalurr^", but other than that the first few I pulled up do seem a bit better.

Also, it seems like I need a -density 300 -units PixelsPerInch in there somewhere, but my first pass created a 3 gig file that gocr wouldn't read, so if I go down that route there are clearly some fine-tunings that need to happen.

#Comment Re: made: 2009-04-22 11:30:38.216725+02 by: DaveP

Have you tried using google's OCR? It's not super-speedy (you have to wait for googlebot to spider your pdfs), but the quality seems to be pretty good: http://www.labnol.org/software/convert-scanned-pdf- images-to-text-with-google-ocr/5158/

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.


Flutterby™ is a trademark claimed by

Dan Lyke
for the web publications at www.flutterby.com and www.flutterby.net.