Thanks. I am now having a 97% success rate with tesseract.exe :D However, I was intending on using tesseract.js by naptha, and I can't get even close to the same results. I will keep trying different pre-processing, but do you think it is worth making another post here, or should that be dealt with somewhere else because it is a different program?
On Tuesday, 20 April 2021 at 7:10:28 pm UTC+10 zdenop wrote: > Tesseract is an OCR engine, so try to eliminate graphics elements by > yourself/send only text areas to OCR. > > Zdenko > > > ut 20. 4. 2021 o 10:40 Soul Green <[email protected]> napísal(a): > >> Omg thanks. >> I hadn't thought about checking *that *documentation. I've been using >> tesseract.js with node so I completely forgot that it was based on >> something else. How amateur. >> I also didn't know that tesseract did its own processing as well. >> Thanks again I'll try everything there >> On Tuesday, 20 April 2021 at 5:14:56 pm UTC+10 zdenop wrote: >> >>> Hint: read documentation, stop guessing. You can start here >>> https://github.com/tesseract-ocr/tessdoc/blob/master/ImproveQuality.md >>> >>> Zdenko >>> >>> >>> ut 20. 4. 2021 o 9:11 Soul Green <[email protected]> napísal(a): >>> >>>> I am very new to coding so forgive me. >>>> >>>> I have been having an extremely low success rate with tesseract. >>>> Here are 3 examples both pre- and post- processing: >>>> >>>> [image: red1.jpg][image: croppedred1.jpg] [image: >>>> yellow1.jpg][image: croppedyellow1.jpg] [image: >>>> blue1.jpg][image: >>>> croppedblue1.jpg] >>>> These were scanned as "a" ,"Ss30", and "moh" respectively. >>>> I consider the yellow one a success, as I can just regex the 30 out of >>>> the result, but I still don't understand how it could be so off for the >>>> rest. >>>> >>>> I've tried different traineddatas, even including one that I trained >>>> myself on over 200 data examples. >>>> >>>> I have three theories as to why I couldn't train it: >>>> 1. The different colours are processed differently, causing differently >>>> shaped characters. (Red looks bold and yellow looks thin) >>>> 2. The different sizes of the images causes the characters to be >>>> slightly differently shaped when cropped. >>>> 3. Tesseract assumes that the two lines of text are one, and reads them >>>> together. >>>> >>>> Could someone please give me a hint on what to try? I don't want to >>>> spend another day training it on just blue ones (for example) only to find >>>> that colour isn't the problem. >>>> Thanks >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/9d819bc5-cf07-4c28-91a6-61b142ccc324n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/7ee0d000-566c-4371-acd2-b4a23b648563n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0de2b470-cfa9-4693-968a-be90fbf9ae89n%40googlegroups.com.

