Re: [tesseract-ocr] Re: How to improve ocr reader?

Teo Sat, 28 Mar 2020 11:39:10 -0700

Ok thanks, I'll keep this.

Il giorno sabato 28 marzo 2020 19:24:12 UTC+1, Lorenzo Blz ha scritto:
>
> If you'd like to improve the OCR accuracy too a simple contrast 
> enhancement (with a simple S shaped curve) and a little sharpening helps 
> with the left border. See the attached file.
>
>
>
> Lorenzo
>
> Il giorno sab 28 mar 2020 alle ore 19:04 Essam Zaky <[email protected] 
> <javascript:>> ha scritto:
>
>> Yes with the same command the result attached
>>
>>
>> بتاريخ السبت، 28 مارس، 2020 7:55:05 م UTC+2، كتب Teo:
>>>
>>> With the same coomand?
>>> tesseract pho.png pho-eng -l eng pdf
>>>
>>>
>>>
>>> Il giorno sabato 28 marzo 2020 18:48:17 UTC+1, Essam Zaky ha scritto:
>>>>
>>>> It works fine in my machine
>>>> It seems it's problem in your pdf viewer
>>>> i used Adobe PDF reader V9.0
>>>>
>>>> there are some pdf readers fail to read serachable pdf , try to check 
>>>> another reader
>>>>
>>>> Best Regards
>>>> Essam
>>>>
>>>> بتاريخ السبت، 28 مارس، 2020 7:34:59 م UTC+2، كتب Teo:
>>>>>
>>>>>
>>>>> Ok
>>>>> Il giorno sabato 28 marzo 2020 18:32:26 UTC+1, Essam Zaky ha scritto:
>>>>>>
>>>>>> PLease attach the original image to check on my machine
>>>>>>
>>>>>> بتاريخ السبت، 28 مارس، 2020 7:24:07 م UTC+2، كتب Teo:
>>>>>>>
>>>>>>> Thanks for the reply. 
>>>>>>> I just opened an issue on github/Tesseract. Then I tried to create 
>>>>>>> an pdf only with tesseract and without gimagereader with: 
>>>>>>> tesseract pho.png pho-eng -l eng pdf
>>>>>>> but this is the result...
>>>>>>>
>>>>>>>
>>>>>>> Il giorno venerdì 27 marzo 2020 03:13:40 UTC+1, Essam Zaky ha 
>>>>>>> scritto:
>>>>>>>>
>>>>>>>> So I guess the error in PDF generation module
>>>>>>>> you have one of the following option
>>>>>>>> -try to enhance the bug by your self
>>>>>>>> -raise an issue in Tesseract issues , but check first that the 
>>>>>>>> issue is not exist in list of issues
>>>>>>>> -Use other extrenal library to create searchable pdf depending on 
>>>>>>>> hocr
>>>>>>>>
>>>>>>>> before tesseract add feature of generating pdf i used library 
>>>>>>>> called itextsharp to generate  the pdf and the result was very good 
>>>>>>>> for me
>>>>>>>>
>>>>>>>> بتاريخ الخميس، 26 مارس، 2020 10:54:50 م UTC+2، كتب Teo:
>>>>>>>>>
>>>>>>>>> Ok coordinates seem correct.
>>>>>>>>>
>>>>>>>>> Il giorno giovedì 26 marzo 2020 19:13:52 UTC+1, Essam Zaky ha 
>>>>>>>>> scritto:
>>>>>>>>>>
>>>>>>>>>> read this document
>>>>>>>>>> https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage
>>>>>>>>>>
>>>>>>>>>> the following command can return the coordinates
>>>>>>>>>>
>>>>>>>>>> tesseract testing/eurotext.png testing/eurotext-eng -l eng hocr
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> hocr contain the word as a text and coordinate
>>>>>>>>>> you can open the image in any image editor such as MSpaint and 
>>>>>>>>>> check the returned coordinates represent the word in images
>>>>>>>>>>
>>>>>>>>>> Best Regards
>>>>>>>>>>
>>>>>>>>>> بتاريخ الخميس، 26 مارس، 2020 1:10:22 م UTC+2، كتب Teo:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your help. how can i get the coordinates, and how do 
>>>>>>>>>>> i check if they are correct?
>>>>>>>>>>>
>>>>>>>>>>> Il giorno mercoledì 25 marzo 2020 10:41:07 UTC+1, Essam Zaky ha 
>>>>>>>>>>> scritto:
>>>>>>>>>>>>
>>>>>>>>>>>> You need now to check the coordinates returned from tesseract 
>>>>>>>>>>>> ,use hocr output and check if words coordinates are returned 
>>>>>>>>>>>> correctly if 
>>>>>>>>>>>> yes so it is a bug in pdf generation
>>>>>>>>>>>>
>>>>>>>>>>>> if the coordinates are wrong it's bug in tesseract 
>>>>>>>>>>>>
>>>>>>>>>>>> for me i used before library called itextsharp to generate 
>>>>>>>>>>>> searchable pdf , the library  ported from itext java library , it 
>>>>>>>>>>>> gives 
>>>>>>>>>>>> good pdf output
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 11:25:46 ص UTC+2، كتب Teo:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok I think that it's  a pdf generation module, because the txt 
>>>>>>>>>>>>> is almost the same with the exception of some "the" which 
>>>>>>>>>>>>> tesseract sees as 
>>>>>>>>>>>>> "thè".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno mercoledì 25 marzo 2020 07:25:11 UTC+1, Essam Zaky 
>>>>>>>>>>>>> ha scritto:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You need to know which to improve tesserct  engine or PDF 
>>>>>>>>>>>>>> generation
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> so compare text file from abby and tesserct 
>>>>>>>>>>>>>> if the result is highly different you need to improve image 
>>>>>>>>>>>>>> quality or improve LSTM 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> if the result of tesseract is good so you need to enhance the 
>>>>>>>>>>>>>> PDF generation module
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> بتاريخ الأربعاء، 25 مارس، 2020 7:04:14 ص UTC+2، كتب Teo:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The quality is already very good, but is lower than abby 
>>>>>>>>>>>>>>> finereader. In attachment there is a comparison between abby 
>>>>>>>>>>>>>>> and 
>>>>>>>>>>>>>>> gimagereader ocr, and you can see the difference. How we 
>>>>>>>>>>>>>>> can improve it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7b8148d7-6075-4bed-9edb-99480001204b%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/7b8148d7-6075-4bed-9edb-99480001204b%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/97569b14-3f65-4147-9012-adbdcef7263a%40googlegroups.com.

Re: [tesseract-ocr] Re: How to improve ocr reader?

Reply via email to