Simple OCR

2009-04-22 03:22:07.003959+00 by Dan Lyke 3 comments

Okay, got some older PDF files that are scans of paper documents. Tried:

convert /home/danlyke/Desktop/fund-summaries-part-1-thru-s-68.pdf fundsummaries.pnm
gocr fundsummaries.pnm

and got some leet speak-ish text out of it, but nothing great. Anyone done this?

comments in ascending chronological order (reverse):

#Comment Re: made: 2009-04-22 09:30:38.216725+00 by: DaveP

Have you tried using google's OCR? It's not super-speedy (you have to wait for googlebot to spider your pdfs), but the quality seems to be pretty good: http://www.labnol.org/software/convert-scanned-pdf- images-to-text-with-google-ocr/5158/

#Comment Re: made: 2009-04-22 15:02:37.649896+00 by: Dan Lyke

Aha! Thanks, Dave. I'll need to dig a bit to try to find the particular documents I'm interested in (they've already been indexed), and I did find the "Comprehensive Annual Financial Report" for the "City of pßtalurr^", but other than that the first few I pulled up do seem a bit better.

Also, it seems like I need a -density 300 -units PixelsPerInch in there somewhere, but my first pass created a 3 gig file that gocr wouldn't read, so if I go down that route there are clearly some fine-tunings that need to happen.

#Comment Re: made: 2009-04-23 09:12:24.830951+00 by: DaveP [edit history]

Hey, whenever there's a "let someone else do the work" solution, I'm all over it. :)