So I guess the error in PDF generation module you have one of the following option -try to enhance the bug by your self -raise an issue in Tesseract issues , but check first that the issue is not exist in list of issues -Use other extrenal library to create searchable pdf depending on hocr
before tesseract add feature of generating pdf i used library called itextsharp to generate the pdf and the result was very good for me بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo: > > Ok coordinates seem correct. > > Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto: >> >> read this document >> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage >> >> the following command can return the coordinates >> >> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr >> >> >> hocr contain the word as a text and coordinate >> you can open the image in any image editor such as MSpaint and check the >> returned coordinates represent the word in images >> >> Best Regards >> >> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo: >>> >>> Thanks for your help. how can i get the coordinates, and how do i check >>> if they are correct? >>> >>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto: >>>> >>>> You need now to check the coordinates returned from tesseract ,use hocr >>>> output and check if words coordinates are returned correctly if yes so it >>>> is a bug in pdf generation >>>> >>>> if the coordinates are wrong it's bug in tesseract >>>> >>>> for me i used before library called itextsharp to generate searchable >>>> pdf , the library ported from itext java library , it gives good pdf >>>> output >>>> >>>> >>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo: >>>>> >>>>> Ok I think that it's a pdf generation module, because the txt is >>>>> almost the same with the exception of some "the" which tesseract sees as >>>>> "thè". >>>>> >>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha >>>>> scritto: >>>>>> >>>>>> You need to know which to improve tesserct engine or PDF generation >>>>>> >>>>>> so compare text file from abby and tesserct >>>>>> if the result is highly different you need to improve image quality >>>>>> or improve LSTM >>>>>> >>>>>> if the result of tesseract is good so you need to enhance the PDF >>>>>> generation module >>>>>> >>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo: >>>>>>> >>>>>>> The quality is already very good, but is lower than abby finereader. >>>>>>> In attachment there is a comparison between abby and gimagereader ocr, >>>>>>> and >>>>>>> you can see the difference. How we can improve it? >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a423ce2-e982-437d-b106-29f61765a4c0%40googlegroups.com.

