Hello, i am using the following version of the software:
tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found SSE I try to convert .tif in to PDF within a python-script: pdf = pytesseract.image_to_pdf_or_hocr(result, lang='deu+tur+kur', extension ='pdf', config='--psm 6') The text "underneeth" the picture is the following (pdftotext -layout xyz.pdf): ... Nach langer Abstinenz ist Apple fulminan t au f de n Mo ni to rm ar kt zu rü ck ge ke hr t: Da s Pr o ... If I use "pure" text-conversion: text = (pytesseract.image_to_string(result, lang='deu+tur+kur',config='--psm 6')) The output is correct (like on the .tif): ... Nach langer Abstinenz ist Apple fulminant auf den Monitormarkt zurückgekehrt: Das Pro ... The text is needed for search operations, so the added whitespaces are quite anoying. Is this a fault of tesseract or did I some thing wrong. Thanks in advance! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0c8bc9f9-d147-4bff-9a25-bc82c6ea0e6b%40googlegroups.com.