Tesseract is OCR images not documents (pdf, docx, odt etc..)
If you need multipage support use tif image format instead of pdf for
scanning.

Zdenko


so 28. 3. 2020 o 20:42 Essam Zaky <[email protected]> napísal(a):

> What do you mean by "scan a pdf " ?
> If you mean recognize pdf file , you can not recognize pdf file directly
> because it's unsupported format by leptonica
> see the following read me
> https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc
>
>
> The workarround is to find a tool which can extract pdf to images , then
> write the extracted images  paths in one text file
> i.e. test.pdf will be
> test.txt
>      ../image/path/1.png
>      ../image/path/2.png
>      ../image/path/3.png
>
> then call tesseract as follow
> tesseract test.txt path/to/output -l eng
>
>
> the output.txt will contain all the recognition result for all files in
> test.txt
>
>
> Best Regards
> Essam
>  بتاريخ السبت، 28 مارس، 2020 8:48:20 م UTC+2، كتب Teo:
>>
>> Is there an option to directly scan a pdf document containing multiple
>> pages instead of the single png image?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ffd9e7c7-8fdd-4ced-8707-eb6ceaf61b68%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ffd9e7c7-8fdd-4ced-8707-eb6ceaf61b68%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xwcue9YezABmkrHX6AoB%3DdfsMvapKMiNT0tVQUBo-t_g%40mail.gmail.com.

Reply via email to