Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-18 Thread Christian Pietsch
Hi Padraic, I have uploaded a shell script which happens to implement Robert Haschart's recipe: https://github.com/pietsch/Data-Munging/blob/master/ocr4pdf.sh Enjoy! Christian On Fri, Oct 18, 2013 at 10:22:17AM +0100, Padraic Stack wrote: I would love to see that bash script if you could

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Eric Lease Morgan
On Oct 16, 2013, at 10:56 AM, Robert Haschart rh...@virginia.edu wrote: The abstract extraction routine I have been working on does use tesseract internally for doing OCR when it encounters a document that doesn't have usable full-text. I agree that tesseract is not that easy to install,

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Christian Pietsch
Hi Eric, On Thu, Oct 17, 2013 at 09:43:04AM -0400, Eric Lease Morgan wrote: Robert, can you outline the process you used to get Tesseract to do OCR agains PDF documents? I installed Tesseract a few months ago, but I couldn't figure out how to get to work against PDF, only some image files.

Re: [CODE4LIB] pdf2txt [tesseract]

2013-10-17 Thread Robert Haschart
On 10/17/2013 9:43 AM, Eric Lease Morgan wrote: On Oct 16, 2013, at 10:56 AM, Robert Haschartrh...@virginia.edu wrote: The abstract extraction routine I have been working on does use tesseract internally for doing OCR when it encounters a document that doesn't have usable full-text. I agree