Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop:
>
> Hello,
>
> if you are referring to some code ("inspecting the code I think found
> some pieces...") please make a reference/link to it.
>
> Tesseract is able to OCR everything that is leptonica able to open or
> everything you or programmer is able to convert to leptonica PIX structure
> ;-)
>
> I did not have a change to test leptonica 1.71, but 1.70 was not able to
> open pdf. So the answer to your 1. question is no. leptonica/tesseract do
> not support OCR-ing of multi-page PDFs neither single pdf. But it support
> multi-page tif.
>
I have tried this twice, but this approach failed (as far as I remember I
got these messages
http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format
). I will try to investigate, why (or what I did wrong) and - in case that
the problem persists - post as a regular bug report. Currently, I am unsure
what really happened.
> Regarding your question 2 - I am not aware about any such initiative.
> tesseract is OCRing images and pdf is not image format but document format
> (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html
> etc.).
>
> Uh., yes, I fully overlooked this, you are right!
Tesseract is according to the documentation and what you said able to OCR
multi-page TIFF, and it can also create a PDF (dual-layer) file with the
input image/s and ocr-ed text. So the only missing thing is the conversion
of a multi-page PDF to a multi-page TIFF, this would then enable Tesseract
to accept multi-page PDFs [sic] as input. My current investigation showed
that Leptonica cannot convert an input multi-page PDF to TIFF multi-page.
>
> Zdenko
>
>
> On Tue, Aug 5, 2014 at 7:52 AM, Tom <[email protected] <javascript:>>
> wrote:
>
>> I am heavily using the new "pdf" option for ocr-ing single PDF pages (or
>> their image equivalents), which works very well. Thanks for the new option
>> in Tesseract svn trunk.
>>
>> When inspecting the code I think found some pieces indicating a
>> "multi-page" actions.
>>
>> - My question 1: Is Tesseract already supporting the OCR-ing of
>> multi-page PDFs ?
>> - My question 2: If answer is not: Are there initiatives to integrate
>> this into Tesseract ?
>>
>> I would appreciate if Tesseract "pdf" works also for multi-page PDFs.
>>
>>
>> Remark:
>>
>> This is how I process multi-page PDFs currently:
>>
>> At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF
>> into single image files, which I then convert one-by-one via Tesseract's
>> "pdf" option, which single-page output I then have to collate again by
>> another script into the final single mixed-mode output PDF file.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected]
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com
>>
>> <https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.