Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option multi-page capable ?

Tom Tue, 05 Aug 2014 00:41:30 -0700


Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop:
>
> Hello,
>
> if you are referring to some code ("inspecting the code I think found 
> some pieces...") please make a reference/link to it.
>
> Tesseract is able to OCR everything that is leptonica able to open or 
> everything you or programmer is able to convert to leptonica PIX structure 
> ;-)
>
> I did not have a change to test leptonica 1.71, but 1.70 was not able to 
> open pdf. So the answer to your 1. question is no. leptonica/tesseract do 
> not support OCR-ing of multi-page PDFs neither single pdf. But it support 
> multi-page tif.
>


I have tried this twice, but this approach failed (as far as I remember I 
got these messages 
http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format
 
). I will try to investigate, why (or what I did wrong) and - in case that 
the problem persists - post as a regular bug report. Currently, I am unsure 
what really happened.


> Regarding your question 2 - I am not aware about any such initiative. 
> tesseract is OCRing images and pdf is not image format but document format 
> (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html 
> etc.).
>
> Uh., yes, I fully overlooked this, you are right!

Tesseract is according to the documentation and what you said able to OCR 
multi-page TIFF, and it can also create a PDF (dual-layer) file with the 
input image/s and ocr-ed text. So the only missing thing is the conversion 
of a multi-page PDF to a multi-page TIFF, this would then enable Tesseract 
to accept multi-page PDFs [sic] as input. My current investigation showed 
that Leptonica cannot convert an input multi-page PDF to TIFF multi-page.


>
> Zdenko
>
>
> On Tue, Aug 5, 2014 at 7:52 AM, Tom <[email protected] <javascript:>> 
> wrote:
>
>> I am heavily using the new "pdf" option for ocr-ing single PDF pages (or 
>> their image equivalents), which works very well. Thanks for the new option 
>> in Tesseract svn trunk.
>>
>> When inspecting the code I think found some pieces indicating a 
>> "multi-page" actions. 
>>
>>    - My question 1: Is Tesseract already supporting the OCR-ing of 
>>    multi-page PDFs ?
>>    - My question 2: If answer is not: Are there initiatives to integrate 
>>    this into Tesseract ?
>>
>> I would appreciate if Tesseract "pdf" works also for multi-page PDFs.
>>
>>
>> Remark:
>>
>> This is how I process multi-page PDFs currently:
>>
>> At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF 
>> into single image files, which I then convert one-by-one via Tesseract's 
>> "pdf" option, which single-page output I then have to collate again by 
>> another script into the final single mixed-mode output PDF file. 
>>
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option *multi-page* capable ?

Reply via email to

Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option multi-page capable ?