[tesseract-ocr] Re: How to improve ocr reader?

Teo Sat, 28 Mar 2020 11:38:09 -0700

Ok thanks a lot.

Il giorno sabato 28 marzo 2020 19:04:25 UTC+1, Essam Zaky ha scritto:
>
> Yes with the same command the result attached
>
>
> بتاريخ السبت، 28 مارس، 2020 7:55:05 م UTC+2، كتب Teo:
>>
>> With the same coomand?
>> tesseract pho.png pho-eng -l eng pdf
>>
>>
>>
>> Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto:
>>>
>>> It works fine in my machine
>>> It seems it's problem in your pdf viewer
>>> i used Adobe PDF reader V9.0
>>>
>>> there are some pdf readers fail to read serachable pdf , try to check 
>>> another reader
>>>
>>> Best Regards
>>> Essam
>>>
>>> بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>>>>
>>>>
>>>> Ok
>>>> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>>>>
>>>>> PLease attach the original image to check on my machine
>>>>>
>>>>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>>>>>
>>>>>> Thanks for the reply. 
>>>>>> I just opened an issue on github/Tesseract. Then I tried to create an 
>>>>>> pdf only with tesseract and without gimagereader with: 
>>>>>> tesseract pho.png pho-eng -l eng pdf
>>>>>> but this is the result...
>>>>>>
>>>>>>
>>>>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha scritto:
>>>>>>>
>>>>>>> So I guess the error in PDF generation module
>>>>>>> you have one of the following option
>>>>>>> -try to enhance the bug by your self
>>>>>>> -raise an issue in Tesseract issues , but check first that the issue 
>>>>>>> is not exist in list of issues
>>>>>>> -Use other extrenal library to create searchable pdf depending on 
>>>>>>> hocr
>>>>>>>
>>>>>>> before tesseract add feature of generating pdf i used library called 
>>>>>>> itextsharp to generate  the pdf and the result was very good for me
>>>>>>>
>>>>>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>>>>>>>>
>>>>>>>> Ok coordinates seem correct.
>>>>>>>>
>>>>>>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha 
>>>>>>>> scritto:
>>>>>>>>>
>>>>>>>>> read this document
>>>>>>>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>>>>>>>>
>>>>>>>>> the following command can return the coordinates
>>>>>>>>>
>>>>>>>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> hocr contain the word as a text and coordinate
>>>>>>>>> you can open the image in any image editor such as MSpaint and 
>>>>>>>>> check the returned coordinates represent the word in images
>>>>>>>>>
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>>>>>>>>
>>>>>>>>>> Thanks for your help. how can i get the coordinates, and how do i 
>>>>>>>>>> check if they are correct?
>>>>>>>>>>
>>>>>>>>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>>>>>>>>>> scritto:
>>>>>>>>>>>
>>>>>>>>>>> You need now to check the coordinates returned from tesseract 
>>>>>>>>>>> ,use hocr output and check if words coordinates are returned 
>>>>>>>>>>> correctly if 
>>>>>>>>>>> yes so it is a bug in pdf generation
>>>>>>>>>>>
>>>>>>>>>>> if the coordinates are wrong it's bug in tesseract 
>>>>>>>>>>>
>>>>>>>>>>> for me i used before library called itextsharp to generate 
>>>>>>>>>>> searchable pdf , the library  ported from itext java library , it 
>>>>>>>>>>> gives 
>>>>>>>>>>> good pdf output
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>>>>>>>>>>
>>>>>>>>>>>> Ok I think that it's  a pdf generation module, because the txt 
>>>>>>>>>>>> is almost the same with the exception of some "the" which 
>>>>>>>>>>>> tesseract sees as 
>>>>>>>>>>>> "thè".
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
>>>>>>>>>>>> scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>>> You need to know which to improve tesserct  engine or PDF 
>>>>>>>>>>>>> generation
>>>>>>>>>>>>>
>>>>>>>>>>>>> so compare text file from abby and tesserct 
>>>>>>>>>>>>> if the result is highly different you need to improve image 
>>>>>>>>>>>>> quality or improve LSTM 
>>>>>>>>>>>>>
>>>>>>>>>>>>> if the result of tesseract is good so you need to enhance the 
>>>>>>>>>>>>> PDF generation module
>>>>>>>>>>>>>
>>>>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The quality is already very good, but is lower than abby 
>>>>>>>>>>>>>> finereader. In attachment there is a comparison between abby and 
>>>>>>>>>>>>>> gimagereader ocr, and you can see the difference. How we can 
>>>>>>>>>>>>>> improve it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5116c498-15c8-4090-b125-1c30579c54f2%40googlegroups.com.

[tesseract-ocr] Re: How to improve ocr reader?

Reply via email to