[tesseract-ocr] Lost spaces in some pdf renderers

DJArty Mon, 19 Feb 2018 06:03:28 -0800

Attached pdf OCRed by ocrmypdf using tesseract 4.00.00alpha 
Linux 4.13.0-32-generic #35~16.04.1-Ubuntu SMP  x86_64 x86_64 x86_64 
GNU/Linux


In some pdf viewers (Evince, Chrome, Opera) all ok but in other (Firefox, 
Alfresco Share, pdfjs) not so good - lost spaces between the words.

So text "Test PDF from LibreOffice" looks like one big word 
"TestPDFfromLibreOffice" after copy/paste.

You can load pdf to pdfjs demo here: 
https://mozilla.github.io/pdf.js/web/viewer.html 

If use some other commercial OCR engines for source pdf - got OCRed pdf 
with normal spaces in all pdf viewers (in pdfjs too all ok).

So this is two side problem:  tesseract devs says - its pdfjs problem,  
pdfjs devs says - its tesseract problem.

Is it possible to solve this "spaces" problem via some keys for tesseract 
(ocrmypdf) to force space recognition (like in other OCRs)?
Or make understanding problem root for some more info for pdfjs devs. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ec213df1-f390-4a42-8943-7c18775141d7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Testpdfsandwich.pdf
Description: Adobe PDF document

[tesseract-ocr] Lost spaces in some pdf renderers

Reply via email to