Hi, pdf is a document format (like odt, doc, docx, rtf). tesseract is processing images. You did not mention what programing language(s) you plan to use, but there plenty of tool for pdf text extraction e.g. textract (python) [1]
If you have "stupid pdf" (just somebody embed to pdf scanned images), just extract images from pdf and then you can use them in tesseract. Another option is to convert pdf to images (so you can process them with tesseract).I have very good experience with mupdf, but people use ghostscript also. There are plenty examples how to do it on the internet (e.g. in python [2]) . Few days ago I found tesseract-ocr-wrapper[3], that focus on OCRing of "stupid pdfs". So maybe this can help you. Just use the already available tools. [1] https://textract.readthedocs.io/en/latest/ [2] https://bucket401.blogspot.com/2021/03/pdf-to-imagemultipage-in-python.html [3] https://github.com/Altabeh/tesseract-ocr-wrapper Zdenko so 24. 4. 2021 o 9:19 Mohammad Waqas Shoukat Ali <[email protected]> napísal(a): > Hi Zdenko, > > My input is different pdf documents that contain things like salary slips > and some other financial documents. We want to use tesseract feature to > extract the name,email address,amounts type of fields from documents. > > On Sat, Apr 24, 2021 at 2:50 PM Zdenko Podobny <[email protected]> wrote: > >> Please be more specific: provide an example of what your input is and >> what you want to achieve. >> >> Zdenko >> >> >> so 24. 4. 2021 o 7:58 Mohammad Waqas Shoukat Ali <[email protected]> >> napísal(a): >> >>> hi team, >>> >>> i want to understand how i can teach my tesseract model for different >>> files format. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/83783733-7696-410f-9400-54b3608da396n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w92H3MqRYA%2Bpz8q7aavH_BUnct3mZUGt9pOGt8ZrbYNg%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8w92H3MqRYA%2Bpz8q7aavH_BUnct3mZUGt9pOGt8ZrbYNg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CABG9Oc%3DvsjUiZLkXg6TMS_C4EWienEqfpxUvKP_%2BEF%3DWrsCnxg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CABG9Oc%3DvsjUiZLkXg6TMS_C4EWienEqfpxUvKP_%2BEF%3DWrsCnxg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ytOYQ1jC%2Bb%2B9s0wBsL3AF0WTv4Yby3XHykHNA0CMr6%2Bw%40mail.gmail.com.

