Hello,
if you are referring to some code ("inspecting the code I think found some
pieces...") please make a reference/link to it.
Tesseract is able to OCR everything that is leptonica able to open or
everything you or programmer is able to convert to leptonica PIX structure
;-)
I did not have a change to test leptonica 1.71, but 1.70 was not able to
open pdf. So the answer to your 1. question is no. leptonica/tesseract do
not support OCR-ing of multi-page PDFs neither single pdf. But it support
multi-page tif.
Regarding your question 2 - I am not aware about any such initiative.
tesseract is OCRing images and pdf is not image format but document format
(e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html
etc.).
Zdenko
On Tue, Aug 5, 2014 at 7:52 AM, Tom <[email protected]> wrote:
> I am heavily using the new "pdf" option for ocr-ing single PDF pages (or
> their image equivalents), which works very well. Thanks for the new option
> in Tesseract svn trunk.
>
> When inspecting the code I think found some pieces indicating a
> "multi-page" actions.
>
> - My question 1: Is Tesseract already supporting the OCR-ing of
> multi-page PDFs ?
> - My question 2: If answer is not: Are there initiatives to integrate
> this into Tesseract ?
>
> I would appreciate if Tesseract "pdf" works also for multi-page PDFs.
>
>
> Remark:
>
> This is how I process multi-page PDFs currently:
>
> At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF
> into single image files, which I then convert one-by-one via Tesseract's
> "pdf" option, which single-page output I then have to collate again by
> another script into the final single mixed-mode output PDF file.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8z%2Bxj8x_U72H%2BM_E7HWy7MDvsY2jN579CwCeoYPuiH%3Drg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.