Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option multi-page capable ?

zdenko podobny Tue, 05 Aug 2014 02:32:35 -0700

On Tue, Aug 5, 2014 at 9:40 AM, Tom <[email protected]> wrote:

>
>
> Am Dienstag, 5. August 2014 09:25:35 UTC+2 schrieb zdenop:
>
>> Hello,
>>
>> if you are referring to some code ("inspecting the code I think found
>> some pieces...") please make a reference/link to it.
>>
>> Tesseract is able to OCR everything that is leptonica able to open or
>> everything you or programmer is able to convert to leptonica PIX structure
>> ;-)
>>
>> I did not have a change to test leptonica 1.71, but 1.70 was not able to
>> open pdf. So the answer to your 1. question is no. leptonica/tesseract do
>> not support OCR-ing of multi-page PDFs neither single pdf. But it support
>> multi-page tif.
>>
>
> I have tried this twice, but this approach failed (as far as I remember I
> got these messages
> http://stackoverflow.com/questions/5083492/problem-with-tesseract-and-tiff-format
> ). I will try to investigate, why (or what I did wrong) and - in case that
> the problem persists - post as a regular bug report. Currently, I am unsure
> what really happened.
>
> 1. that would be the leptonica issue and not tesseract issue
2. there are already solutions, so there should be no problem to use
convert pdf to tif


>
>> Regarding your question 2 - I am not aware about any such initiative.
>> tesseract is OCRing images and pdf is not image format but document format
>> (e.g. request to OCR pdf is the same as request to OCR odt, doc, docx, html
>> etc.).
>>
>> Uh., yes, I fully overlooked this, you are right!
>
> Tesseract is according to the documentation and what you said able to OCR
> multi-page TIFF, and it can also create a PDF (dual-layer) file with the
> input image/s and ocr-ed text. So the only missing thing is the conversion
> of a multi-page PDF to a multi-page TIFF, this would then enable Tesseract
> to accept multi-page PDFs [sic] as input. My current investigation showed
> that Leptonica cannot convert an input multi-page PDF to TIFF multi-page.
>
>
>>
>> Zdenko
>>
>>
>> On Tue, Aug 5, 2014 at 7:52 AM, Tom <[email protected]> wrote:
>>
>>>  I am heavily using the new "pdf" option for ocr-ing single PDF pages
>>> (or their image equivalents), which works very well. Thanks for the new
>>> option in Tesseract svn trunk.
>>>
>>> When inspecting the code I think found some pieces indicating a
>>> "multi-page" actions.
>>>
>>>    - My question 1: Is Tesseract already supporting the OCR-ing of
>>>    multi-page PDFs ?
>>>    - My question 2: If answer is not: Are there initiatives to
>>>    integrate this into Tesseract ?
>>>
>>> I would appreciate if Tesseract "pdf" works also for multi-page PDFs.
>>>
>>>
>>> Remark:
>>>
>>> This is how I process multi-page PDFs currently:
>>>
>>> At the moment I do have a script (using pdftk/PDFToolkit) to split a PDF
>>> into single image files, which I then convert one-by-one via Tesseract's
>>> "pdf" option, which single-page output I then have to collate again by
>>> another script into the final single mixed-mode output PDF file.
>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>>
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/f85d93e3-ea49-47bc-aab9-5af9b4a268b1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a9ddcb3e-ea37-4a62-839d-ee5c2e32cd20%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xdOyy1mTPC2s3QLn0vWN8GoU3X1rZ7g5_RbtjzxbJOBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option *multi-page* capable ?

Reply via email to

Re: [tesseract-ocr] [Clarification question] Are there initiatives to makeTesseract's 3.03+ new "pdf" OCR option multi-page capable ?