On Tue, Aug 5, 2014 at 9:40 AM, Tom <[email protected]> wrote: > > > Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop: > >> Hello, >> >> if you are referring to some code ("inspecting the code I think found >> some pieces...") please make a reference/link to it. >> >> Tesseract is able to OCR everything that is leptonica able to open or >> everything you or programmer is able to convert to leptonica PIX structure >> ;-) >> >> I did not have a change to test leptonica 1.71, but 1.70 was not able to >> open pdf. So the answer to your 1. question is no. leptonica/tesseract do >> not support OCR-ing of multi-page PDFs neither single pdf. But it support >> multi-page tif. >> > > I have tried this twice, but this approach failed (as far as I remember I > got these messages > http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format > ). I will try to investigate, why (or what I did wrong) and - in case that > the problem persists - post as a regular bug report. Currently, I am unsure > what really happened. > > 1. that would be the leptonica issue and not tesseract issue 2. there are already solutions, so there should be no problem to use convert pdf to tif
> >> Regarding your question 2 - I am not aware about any such initiative. >> tesseract is OCRing images and pdf is not image format but document format >> (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html >> etc.). >> >> Uh., yes, I fully overlooked this, you are right! > > Tesseract is according to the documentation and what you said able to OCR > multi-page TIFF, and it can also create a PDF (dual-layer) file with the > input image/s and ocr-ed text. So the only missing thing is the conversion > of a multi-page PDF to a multi-page TIFF, this would then enable Tesseract > to accept multi-page PDFs [sic] as input. My current investigation showed > that Leptonica cannot convert an input multi-page PDF to TIFF multi-page. > > >> >> Zdenko >> >> >> On Tue, Aug 5, 2014 at 7:52 AM, Tom <[email protected]> wrote: >> >>> I am heavily using the new "pdf" option for ocr-ing single PDF pages >>> (or their image equivalents), which works very well. Thanks for the new >>> option in Tesseract svn trunk. >>> >>> When inspecting the code I think found some pieces indicating a >>> "multi-page" actions. >>> >>> - My question 1: Is Tesseract already supporting the OCR-ing of >>> multi-page PDFs ? >>> - My question 2: If answer is not: Are there initiatives to >>> integrate this into Tesseract ? >>> >>> I would appreciate if Tesseract "pdf" works also for multi-page PDFs. >>> >>> >>> Remark: >>> >>> This is how I process multi-page PDFs currently: >>> >>> At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF >>> into single image files, which I then convert one-by-one via Tesseract's >>> "pdf" option, which single-page output I then have to collate again by >>> another script into the final single mixed-mode output PDF file. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xdOyy1mTPC2s3QLn0vWN8GoU3X1rZ7g5_RbtjzxbJOBQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

