Attached pdf OCRed by ocrmypdf using tesseract 4.00.00alpha 
Linux 4.13.0-32-generic #35~16.04.1-Ubuntu SMP  x86_64 x86_64 x86_64 

In some pdf viewers (Evince, Chrome, Opera) all ok but in other (Firefox, 
Alfresco Share, pdfjs) not so good - lost spaces between the words.

So text "Test PDF from LibreOffice" looks like one big word 
"TestPDFfromLibreOffice" after copy/paste.

You can load pdf to pdfjs demo here: 

If use some other commercial OCR engines for source pdf - got OCRed pdf 
with normal spaces in all pdf viewers (in pdfjs too all ok).

So this is two side problem:  tesseract devs says - its pdfjs problem,  
pdfjs devs says - its tesseract problem.

Is it possible to solve this "spaces" problem via some keys for tesseract 
(ocrmypdf) to force space recognition (like in other OCRs)?
Or make understanding problem root for some more info for pdfjs devs. 

You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
To post to this group, send email to
Visit this group at
To view this discussion on the web visit
For more options, visit

Attachment: Testpdfsandwich.pdf
Description: Adobe PDF document

Reply via email to