Hi,

pdf is a document format (like odt, doc, docx, rtf). tesseract is
processing images.
You did not mention what programing language(s) you plan to use, but there
plenty of tool for pdf text extraction e.g. textract (python) [1]

If you have "stupid pdf" (just somebody embed to pdf scanned images), just
extract images from pdf and then you can use them in tesseract.

Another option is to convert pdf to images (so you can process them with
tesseract).I have very good experience with mupdf, but people use
ghostscript also. There are plenty examples how to do it on the internet
(e.g. in python [2]) .
Few days ago I found  tesseract-ocr-wrapper[3], that focus on OCRing of
"stupid pdfs". So maybe this can help you.

Just use the already available tools.

[1] https://textract.readthedocs.io/en/latest/
[2]
https://bucket401.blogspot.com/2021/03/pdf-to-imagemultipage-in-python.html
[3] https://github.com/Altabeh/tesseract-ocr-wrapper

Zdenko


so 24. 4. 2021 o 9:19 Mohammad Waqas Shoukat Ali <[email protected]>
napísal(a):

> Hi Zdenko,
>
> My input is different pdf documents that contain things like salary slips
> and some other financial documents. We want to use tesseract feature to
> extract the name,email address,amounts type of fields from documents.
>
> On Sat, Apr 24, 2021 at 2:50 PM Zdenko Podobny <[email protected]> wrote:
>
>> Please be more specific: provide an example of what your input is and
>> what you want to achieve.
>>
>> Zdenko
>>
>>
>> so 24. 4. 2021 o 7:58 Mohammad Waqas Shoukat Ali <[email protected]>
>> napísal(a):
>>
>>> hi team,
>>>
>>> i want to understand how i can teach my tesseract model for different
>>> files format.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w92H3MqRYA%2Bpz8q7aavH_BUnct3mZUGt9pOGt8ZrbYNg%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w92H3MqRYA%2Bpz8q7aavH_BUnct3mZUGt9pOGt8ZrbYNg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABG9Oc%3DvsjUiZLkXg6TMS_C4EWienEqfpxUvKP_%2BEF%3DWrsCnxg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABG9Oc%3DvsjUiZLkXg6TMS_C4EWienEqfpxUvKP_%2BEF%3DWrsCnxg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ytOYQ1jC%2Bb%2B9s0wBsL3AF0WTv4Yby3XHykHNA0CMr6%2Bw%40mail.gmail.com.

Reply via email to