You need now to check the coordinates returned from tesseract ,use hocr 
output and check if words coordinates are returned correctly if yes so it 
is a bug in pdf generation

if the coordinates are wrong it's bug in tesseract 

for me i used before library called itextsharp to generate searchable pdf , 
the library  ported from itext java library , it gives good pdf output


بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>
> Ok I think that it's  a pdf generation module, because the txt is almost 
> the same with the exception of some "the" which tesseract sees as "thè".
>
> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto:
>>
>> You need to know which to improve tesserct  engine or PDF generation
>>
>> so compare text file from abby and tesserct 
>> if the result is highly different you need to improve image quality or 
>> improve LSTM 
>>
>> if the result of tesseract is good so you need to enhance the PDF 
>> generation module
>>
>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>
>>> The quality is already very good, but is lower than abby finereader. In 
>>> attachment there is a comparison between abby and gimagereader ocr, and you 
>>> can see the difference. How we can improve it?
>>>
>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a4934f98-f1bc-4fcf-9bc1-c4805c143094%40googlegroups.com.

Reply via email to