Simple OCR
2009-04-22 05:22:07.003959+02 by
Dan Lyke
3 comments
Okay, got some older PDF files that are scans of paper documents. Tried:
convert /home/danlyke/Desktop/fund-summaries-part-1-thru-s-68.pdf fundsummaries.pnm
gocr fundsummaries.pnm
and got some leet speak-ish text out of it, but nothing great. Anyone done this?
comments in descending chronological order (reverse):
#Comment Re: made: 2009-04-23 11:12:24.830951+02 by:
DaveP
[edit history]
Hey, whenever there's a "let someone else do the work" solution, I'm all over it. :)
#Comment Re: made: 2009-04-22 17:02:37.649896+02 by:
Dan Lyke
Aha! Thanks, Dave. I'll need to dig a bit to try to find the particular documents I'm interested in (they've already been indexed), and I did find the "Comprehensive Annual Financial Report" for the "City of pßtalurr^", but other than that the first few I pulled up do seem a bit better.
Also, it seems like I need a -density 300 -units PixelsPerInch in there somewhere, but my first pass created a 3 gig file that gocr wouldn't read, so if I go down that route there are clearly some fine-tunings that need to happen.
#Comment Re: made: 2009-04-22 11:30:38.216725+02 by:
DaveP
Have you tried using google's OCR? It's not super-speedy (you have to wait for googlebot to spider your
pdfs), but the quality seems to be pretty good: http://www.labnol.org/software/convert-scanned-pdf-
images-to-text-with-google-ocr/5158/
We will not edit your comments. However, we may delete your
comments, or cause them to be hidden behind another link, if we feel
they detract from the conversation. Commercial plugs are fine,
if they are relevant to the conversation, and if you don't
try to pretend to be a consumer. Annoying endorsements will be deleted
if you're lucky, if you're not a whole bunch of people smarter and
more articulate than you will ridicule you, and we will leave
such ridicule in place.