With the same coomand? tesseract pho.png pho-eng -l eng pdf
Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto: > > It works fine in my machine > It seems it's problem in your pdf viewer > i used Adobe PDF reader V9.0 > > there are some pdf readers fail to read serachable pdf , try to check > another reader > > Best Regards > Essam > > بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo: >> >> >> Ok >> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto: >>> >>> PLease attach the original image to check on my machine >>> >>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo: >>>> >>>> Thanks for the reply. >>>> I just opened an issue on github/Tesseract. Then I tried to create an >>>> pdf only with tesseract and without gimagereader with: >>>> tesseract pho.png pho-eng -l eng pdf >>>> but this is the result... >>>> >>>> >>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto: >>>>> >>>>> So I guess the error in PDF generation module >>>>> you have one of the following option >>>>> -try to enhance the bug by your self >>>>> -raise an issue in Tesseract issues , but check first that the issue >>>>> is not exist in list of issues >>>>> -Use other extrenal library to create searchable pdf depending on hocr >>>>> >>>>> before tesseract add feature of generating pdf i used library called >>>>> itextsharp to generate the pdf and the result was very good for me >>>>> >>>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo: >>>>>> >>>>>> Ok coordinates seem correct. >>>>>> >>>>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto: >>>>>>> >>>>>>> read this document >>>>>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage >>>>>>> >>>>>>> the following command can return the coordinates >>>>>>> >>>>>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr >>>>>>> >>>>>>> >>>>>>> hocr contain the word as a text and coordinate >>>>>>> you can open the image in any image editor such as MSpaint and check >>>>>>> the returned coordinates represent the word in images >>>>>>> >>>>>>> Best Regards >>>>>>> >>>>>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo: >>>>>>>> >>>>>>>> Thanks for your help. how can i get the coordinates, and how do i >>>>>>>> check if they are correct? >>>>>>>> >>>>>>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha >>>>>>>> scritto: >>>>>>>>> >>>>>>>>> You need now to check the coordinates returned from tesseract ,use >>>>>>>>> hocr output and check if words coordinates are returned correctly if >>>>>>>>> yes so >>>>>>>>> it is a bug in pdf generation >>>>>>>>> >>>>>>>>> if the coordinates are wrong it's bug in tesseract >>>>>>>>> >>>>>>>>> for me i used before library called itextsharp to generate >>>>>>>>> searchable pdf , the library ported from itext java library , it >>>>>>>>> gives >>>>>>>>> good pdf output >>>>>>>>> >>>>>>>>> >>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo: >>>>>>>>>> >>>>>>>>>> Ok I think that it's a pdf generation module, because the txt is >>>>>>>>>> almost the same with the exception of some "the" which tesseract >>>>>>>>>> sees as >>>>>>>>>> "thè". >>>>>>>>>> >>>>>>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha >>>>>>>>>> scritto: >>>>>>>>>>> >>>>>>>>>>> You need to know which to improve tesserct engine or PDF >>>>>>>>>>> generation >>>>>>>>>>> >>>>>>>>>>> so compare text file from abby and tesserct >>>>>>>>>>> if the result is highly different you need to improve image >>>>>>>>>>> quality or improve LSTM >>>>>>>>>>> >>>>>>>>>>> if the result of tesseract is good so you need to enhance the >>>>>>>>>>> PDF generation module >>>>>>>>>>> >>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo: >>>>>>>>>>>> >>>>>>>>>>>> The quality is already very good, but is lower than abby >>>>>>>>>>>> finereader. In attachment there is a comparison between abby and >>>>>>>>>>>> gimagereader ocr, and you can see the difference. How we can >>>>>>>>>>>> improve it? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/de82d805-d1e2-44e1-aef2-4bab79eadd21%40googlegroups.com.

