[tesseract-ocr] Problems with pdf out put from tesseract

che Tue, 24 Mar 2020 07:20:08 -0700

Hello,

i am using the following version of the software:


 tesseract 4.0.0
 leptonica-1.76.0
 libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 
: libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found SSE

I try to convert .tif in to PDF within a python-script:

pdf = pytesseract.image_to_pdf_or_hocr(result, lang='deu+tur+kur', extension
='pdf', config='--psm 6')

The text "underneeth" the picture is the following (pdftotext -layout 
xyz.pdf):

...
Nach langer Abstinenz ist Apple fulminan              t  au f  de n  Mo ni  
to rm ar  kt   zu rü ck ge ke hr t:  Da s  Pr o
...

If I use "pure" text-conversion:

 text = (pytesseract.image_to_string(result, lang='deu+tur+kur',config='--psm 
6'))

The output is correct (like on the .tif):

...
Nach langer Abstinenz ist Apple fulminant auf den Monitormarkt 
zurückgekehrt: Das Pro
...

The text is needed for search operations, so the added whitespaces are 
quite anoying.

Is this a fault of tesseract or did I some thing wrong.

Thanks in advance!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0c8bc9f9-d147-4bff-9a25-bc82c6ea0e6b%40googlegroups.com.

[tesseract-ocr] Problems with pdf out put from tesseract

Reply via email to