Re: [tesseract-ocr] My data looks clean, why is it not recognised properly

Soul Green Mon, 26 Apr 2021 19:24:28 -0700

Thanks. I am now having a 97% success rate with tesseract.exe :D
However, I was intending on using tesseract.js by naptha, and I can't get 
even close to the same results.
I will keep trying different pre-processing, but do you think it is worth 
making another post here, or should that be dealt with somewhere else 
because it is a different program?


On Tuesday, 20 April 2021 at 7:10:28 pm UTC+10 zdenop wrote:

> Tesseract is an OCR engine, so try to eliminate graphics elements by 
> yourself/send only text areas to OCR.
>
> Zdenko
>
>
> ut 20. 4. 2021 o 10:40 Soul Green <[email protected]> napísal(a):
>
>> Omg thanks.
>> I hadn't thought about checking *that *documentation. I've been using 
>> tesseract.js with node so I completely forgot that it was based on 
>> something else. How amateur.
>> I also didn't know that tesseract did its own processing as well.
>> Thanks again I'll try everything there
>> On Tuesday, 20 April 2021 at 5:14:56 pm UTC+10 zdenop wrote:
>>
>>> Hint: read documentation, stop guessing. You can start here 
>>> https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md
>>>
>>> Zdenko
>>>
>>>
>>> ut 20. 4. 2021 o 9:11 Soul Green <[email protected]> napísal(a):
>>>
>>>> I am very new to coding so forgive me.
>>>>
>>>> I have been having an extremely low success rate with tesseract.
>>>> Here are 3 examples both pre- and post- processing:
>>>>
>>>> [image: red1.jpg][image: croppedred1.jpg]            [image: 
>>>> yellow1.jpg][image: croppedyellow1.jpg]              [image: 
>>>> blue1.jpg][image: 
>>>> croppedblue1.jpg]
>>>> These were scanned as "a" ,"Ss30", and "moh" respectively.
>>>> I consider the yellow one a success, as I can just regex the 30 out of 
>>>> the result, but I still don't understand how it could be so off for the 
>>>> rest.
>>>>
>>>> I've tried different traineddatas, even including one that I trained 
>>>> myself on over 200 data examples.
>>>>
>>>> I have three theories as to why I couldn't train it:
>>>> 1. The different colours are processed differently, causing differently 
>>>> shaped characters. (Red looks bold and yellow looks thin)
>>>> 2. The different sizes of the images causes the characters to be 
>>>> slightly differently shaped when cropped.
>>>> 3. Tesseract assumes that the two lines of text are one, and reads them 
>>>> together.
>>>>  
>>>> Could someone please give me a hint on what to try? I don't want to 
>>>> spend another day training it on just blue ones (for example) only to find 
>>>> that colour isn't the problem.
>>>> Thanks
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0de2b470-cfa9-4693-968a-be90fbf9ae89n%40googlegroups.com.

Re: [tesseract-ocr] My data looks clean, why is it not recognised properly

Reply via email to