It works fine in my machine
It seems it's problem in your pdf viewer
i used Adobe PDF reader V9.0

there are some pdf readers fail to read serachable pdf , try to check 
another reader

Best Regards
Essam

بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>
>
> Ok
> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>
>> PLease attach the original image to check on my machine
>>
>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>>
>>> Thanks for the reply. 
>>> I just opened an issue on github/Tesseract. Then I tried to create an 
>>> pdf only with tesseract and without gimagereader with: 
>>> tesseract pho.png pho-eng -l eng pdf
>>> but this is the result...
>>>
>>>
>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:
>>>>
>>>> So I guess the error in PDF generation module
>>>> you have one of the following option
>>>> -try to enhance the bug by your self
>>>> -raise an issue in Tesseract issues , but check first that the issue is 
>>>> not exist in list of issues
>>>> -Use other extrenal library to create searchable pdf depending on hocr
>>>>
>>>> before tesseract add feature of generating pdf i used library called 
>>>> itextsharp to generate  the pdf and the result was very good for me
>>>>
>>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>>>>>
>>>>> Ok coordinates seem correct.
>>>>>
>>>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>>>>>>
>>>>>> read this document
>>>>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>>>>>
>>>>>> the following command can return the coordinates
>>>>>>
>>>>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>>>>>
>>>>>>
>>>>>> hocr contain the word as a text and coordinate
>>>>>> you can open the image in any image editor such as MSpaint and check 
>>>>>> the returned coordinates represent the word in images
>>>>>>
>>>>>> Best Regards
>>>>>>
>>>>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>>>>>
>>>>>>> Thanks for your help. how can i get the coordinates, and how do i 
>>>>>>> check if they are correct?
>>>>>>>
>>>>>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>>>>>>> scritto:
>>>>>>>>
>>>>>>>> You need now to check the coordinates returned from tesseract ,use 
>>>>>>>> hocr output and check if words coordinates are returned correctly if 
>>>>>>>> yes so 
>>>>>>>> it is a bug in pdf generation
>>>>>>>>
>>>>>>>> if the coordinates are wrong it's bug in tesseract 
>>>>>>>>
>>>>>>>> for me i used before library called itextsharp to generate 
>>>>>>>> searchable pdf , the library  ported from itext java library , it 
>>>>>>>> gives 
>>>>>>>> good pdf output
>>>>>>>>
>>>>>>>>
>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>>>>>>>
>>>>>>>>> Ok I think that it's  a pdf generation module, because the txt is 
>>>>>>>>> almost the same with the exception of some "the" which tesseract sees 
>>>>>>>>> as 
>>>>>>>>> "thè".
>>>>>>>>>
>>>>>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
>>>>>>>>> scritto:
>>>>>>>>>>
>>>>>>>>>> You need to know which to improve tesserct  engine or PDF 
>>>>>>>>>> generation
>>>>>>>>>>
>>>>>>>>>> so compare text file from abby and tesserct 
>>>>>>>>>> if the result is highly different you need to improve image 
>>>>>>>>>> quality or improve LSTM 
>>>>>>>>>>
>>>>>>>>>> if the result of tesseract is good so you need to enhance the PDF 
>>>>>>>>>> generation module
>>>>>>>>>>
>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>>>>>>>>>
>>>>>>>>>>> The quality is already very good, but is lower than abby 
>>>>>>>>>>> finereader. In attachment there is a comparison between abby and 
>>>>>>>>>>> gimagereader ocr, and you can see the difference. How we can 
>>>>>>>>>>> improve it?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/28825e73-ab2c-4941-8a0c-cd10c4bc8e95%40googlegroups.com.

Reply via email to