Re: [CODE4LIB] Scanned PDF to text

2014-12-11 Thread David J. Fiander
Art Rhyno talked about doing this with scans of old community newspapers a few years ago (https://www.youtube.com/watch?v=gcjCiS9pJ3A) Yes, it's very compute intensive and slow. He set up Hadoop to farm jobs out to the PCs in the library's public lab while the library was closed at night. - David

Re: [CODE4LIB] Scanned PDF to text

2014-12-11 Thread Chris Fitzpatrick
Tesseract is going to be slow, and there might not much you can do about that. You can do a couple of things, like set up a processes that run on AWS EC2 spot instances, so you can put a standing bid order on AWS instances and only run your OCR when the price drops. Or you can buy ABBYY , which i

Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Kyle Banerjee
> I’m not quite sure if I understand the question, but if all you want to do is > pull the text out of an OCR’ed PDF file, then I have found both Tika and > PDFtotext to be useful tools > > On the other hand, if you need to do the OCR itself, then employing Tesseract > is probably the way t

Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Eric Lease Morgan
On Dec 9, 2014, at 8:25 AM, Kyle Banerjee wrote: > I've just started a project that involves harvesting large numbers of > scanned PDF's and extracting information from the text from the OCR output. > The process I've started with -- use imagemagick to convert to tiff and > tesseract to pull out

Re: [CODE4LIB] Scanned PDF to text

2014-12-09 Thread Mads Villadsen
On 2014-12-09 14:25, Kyle Banerjee wrote: Howdy all, I've just started a project that involves harvesting large numbers of scanned PDF's and extracting information from the text from the OCR output. The process I've started with -- use imagemagick to convert to tiff and tesseract to pull out the

[CODE4LIB] Scanned PDF to text

2014-12-09 Thread Kyle Banerjee
Howdy all, I've just started a project that involves harvesting large numbers of scanned PDF's and extracting information from the text from the OCR output. The process I've started with -- use imagemagick to convert to tiff and tesseract to pull out the OCR -- is more system intensive than I hope