[tesseract-ocr] Re: How to improve ocr reader?

Essam Zaky Thu, 26 Mar 2020 19:14:21 -0700

So I guess the error in PDF generation module
you have one of the following option
-try to enhance the bug by your self
-raise an issue in Tesseract issues , but check first that the issue is not 
exist in list of issues
-Use other extrenal library to create searchable pdf depending on hocr


before tesseract add feature of generating pdf i used library called 
itextsharp to generate  the pdf and the result was very good for me

بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>
> Ok coordinates seem correct.
>
> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha scritto:
>>
>> read this document
>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>
>> the following command can return the coordinates
>>
>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>
>>
>> hocr contain the word as a text and coordinate
>> you can open the image in any image editor such as MSpaint and check the 
>> returned coordinates represent the word in images
>>
>> Best Regards
>>
>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>
>>> Thanks for your help. how can i get the coordinates, and how do i check 
>>> if they are correct?
>>>
>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha scritto:
>>>>
>>>> You need now to check the coordinates returned from tesseract ,use hocr 
>>>> output and check if words coordinates are returned correctly if yes so it 
>>>> is a bug in pdf generation
>>>>
>>>> if the coordinates are wrong it's bug in tesseract 
>>>>
>>>> for me i used before library called itextsharp to generate searchable 
>>>> pdf , the library  ported from itext java library , it gives good pdf 
>>>> output
>>>>
>>>>
>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>>>
>>>>> Ok I think that it's  a pdf generation module, because the txt is 
>>>>> almost the same with the exception of some "the" which tesseract sees as 
>>>>> "thè".
>>>>>
>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky ha 
>>>>> scritto:
>>>>>>
>>>>>> You need to know which to improve tesserct  engine or PDF generation
>>>>>>
>>>>>> so compare text file from abby and tesserct 
>>>>>> if the result is highly different you need to improve image quality 
>>>>>> or improve LSTM 
>>>>>>
>>>>>> if the result of tesseract is good so you need to enhance the PDF 
>>>>>> generation module
>>>>>>
>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>>>>>
>>>>>>> The quality is already very good, but is lower than abby finereader. 
>>>>>>> In attachment there is a comparison between abby and gimagereader ocr, 
>>>>>>> and 
>>>>>>> you can see the difference. How we can improve it?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9a423ce2-e982-437d-b106-29f61765a4c0%40googlegroups.com.

[tesseract-ocr] Re: How to improve ocr reader?

Reply via email to