You need now to check the coordinates returned from tesseract ,use hocr output and check if words coordinates are returned correctly if yes so it is a bug in pdf generation
if the coordinates are wrong it's bug in tesseract for me i used before library called itextsharp to generate searchable pdf , the library ported from itext java library , it gives good pdf output بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo: > > Ok I think that it's a pdf generation module, because the txt is almost > the same with the exception of some "the" which tesseract sees as "thè". > > Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha scritto: >> >> You need to know which to improve tesserct engine or PDF generation >> >> so compare text file from abby and tesserct >> if the result is highly different you need to improve image quality or >> improve LSTM >> >> if the result of tesseract is good so you need to enhance the PDF >> generation module >> >> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo: >>> >>> The quality is already very good, but is lower than abby finereader. In >>> attachment there is a comparison between abby and gimagereader ocr, and you >>> can see the difference. How we can improve it? >>> >>> >>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a4934f98-f1bc-4fcf-9bc1-c4805c143094%40googlegroups.com.