It works fine in my machine It seems it's problem in your pdf viewer i used Adobe PDF reader V9.0
there are some pdf readers fail to read serachable pdf , try to check another reader Best Regards Essam بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo: > > > Ok > Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto: >> >> PLease attach the original image to check on my machine >> >> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo: >>> >>> Thanks for the reply. >>> I just opened an issue on github/Tesseract. Then I tried to create an >>> pdf only with tesseract and without gimagereader with: >>> tesseract pho.png pho-eng -l eng pdf >>> but this is the result... >>> >>> >>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto: >>>> >>>> So I guess the error in PDF generation module >>>> you have one of the following option >>>> -try to enhance the bug by your self >>>> -raise an issue in Tesseract issues , but check first that the issue is >>>> not exist in list of issues >>>> -Use other extrenal library to create searchable pdf depending on hocr >>>> >>>> before tesseract add feature of generating pdf i used library called >>>> itextsharp to generate the pdf and the result was very good for me >>>> >>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo: >>>>> >>>>> Ok coordinates seem correct. >>>>> >>>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto: >>>>>> >>>>>> read this document >>>>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage >>>>>> >>>>>> the following command can return the coordinates >>>>>> >>>>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr >>>>>> >>>>>> >>>>>> hocr contain the word as a text and coordinate >>>>>> you can open the image in any image editor such as MSpaint and check >>>>>> the returned coordinates represent the word in images >>>>>> >>>>>> Best Regards >>>>>> >>>>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo: >>>>>>> >>>>>>> Thanks for your help. how can i get the coordinates, and how do i >>>>>>> check if they are correct? >>>>>>> >>>>>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha >>>>>>> scritto: >>>>>>>> >>>>>>>> You need now to check the coordinates returned from tesseract ,use >>>>>>>> hocr output and check if words coordinates are returned correctly if >>>>>>>> yes so >>>>>>>> it is a bug in pdf generation >>>>>>>> >>>>>>>> if the coordinates are wrong it's bug in tesseract >>>>>>>> >>>>>>>> for me i used before library called itextsharp to generate >>>>>>>> searchable pdf , the library ported from itext java library , it >>>>>>>> gives >>>>>>>> good pdf output >>>>>>>> >>>>>>>> >>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo: >>>>>>>>> >>>>>>>>> Ok I think that it's a pdf generation module, because the txt is >>>>>>>>> almost the same with the exception of some "the" which tesseract sees >>>>>>>>> as >>>>>>>>> "thè". >>>>>>>>> >>>>>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha >>>>>>>>> scritto: >>>>>>>>>> >>>>>>>>>> You need to know which to improve tesserct engine or PDF >>>>>>>>>> generation >>>>>>>>>> >>>>>>>>>> so compare text file from abby and tesserct >>>>>>>>>> if the result is highly different you need to improve image >>>>>>>>>> quality or improve LSTM >>>>>>>>>> >>>>>>>>>> if the result of tesseract is good so you need to enhance the PDF >>>>>>>>>> generation module >>>>>>>>>> >>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo: >>>>>>>>>>> >>>>>>>>>>> The quality is already very good, but is lower than abby >>>>>>>>>>> finereader. In attachment there is a comparison between abby and >>>>>>>>>>> gimagereader ocr, and you can see the difference. How we can >>>>>>>>>>> improve it? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/28825e73-ab2c-4941-8a0c-cd10c4bc8e95%40googlegroups.com.

