I tried using tesstrain but am not getting 0% accuracy, so any help on what
I'm doing wrong or misunderstanding would be greatly appreciated.

Specifically, here is what I did given my 20K check images and data from my
x9.37 file.  For each check, I
1. cropped the image so that they included only the bottom of the check
with the MICR line
2. generated the gt.txt file based on the values for the check from the
x9.37 file associated with the MICR line
3. ran "make training MODEL_NAME=micr_e13b" until it terminated.  The BCER
was at about 34%.

I then used the resulting micr_d13b.traineddata file but it yielded dismal
results.  So I looked at the box files that were generated, and each of
them had the same coordinates for each character which covered the entire
image area.
So I looked at the generate_line_box.py script and it seems that is what it
is coded to do from looking at
https://github.com/tesseract-ocr/tesstrain/blob/main/generate_line_box.py#L26

Shouldn't the box file coordinates be different for each character?

Thanks,
Keith

On Fri, Oct 13, 2023 at 10:59 AM Keith Smith <[email protected]>
wrote:

> Thanks Shree for the clarification.  I'll give it a try.  I was following
> https://github.com/tesseract-ocr/tessdoc/blob/main/tess5/TrainingTesseract-5.md
> and obviously misunderstood.
>
> On Fri, Oct 13, 2023 at 7:54 AM Shree Devi Kumar <[email protected]>
> wrote:
>
>> See also
>>
>> https://github.com/tesseract-ocr/tesstrain/wiki
>>
>> It has details about training using the makefile.
>>
>> On Fri, Oct 13, 2023, 3:43 PM Keith Smith <[email protected]>
>> wrote:
>>
>>> Yes I have.  I am asking about how to automate the generation of the
>>> ground truth images and box files, because from what I understand,
>>> tesseract requires on the order of 10K images and box files to train on.
>>> However, unless I am missing something, what I read at
>>> https://github.com/tesseract-ocr/tesstrain assumes the ground truth
>>> (images + box files) already exist.
>>>
>>> On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar <[email protected]>
>>> wrote:
>>>
>>>> Have you looked at
>>>>
>>>> https://github.com/tesseract-ocr/tesstrain
>>>>
>>>>
>>>>
>>>> On Thu, Oct 12, 2023, 11:45 PM Keith Smith <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to use tesseract to OCR the MICR line of checks (i.e. the
>>>>> micr-e13b font).  The training data that I found at
>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR/blob/master/Tessdata/mcr.traineddata
>>>>> does not produce accurate results on my data set.
>>>>>
>>>>> I have a set of over 20K check images along with the MICR text for
>>>>> those images; however, I do not have box files for them.
>>>>>
>>>>> So I started generating box files and manually correcting them via
>>>>> JTessBoxEditor, but I soon learned that it would take a LONG time to do
>>>>> this for enough checks to properly train tesseract.  So I am just started
>>>>> generating synthetic images using tesseract's text2image; however, the
>>>>> images generated are perfect (i.e. no blur, skew, etc), so I am doubting
>>>>> that this will result in training tesseract to handle my less-than-perfect
>>>>> check images.
>>>>>
>>>>> Does anyone have suggestions for the best methodology to use?  Is
>>>>> there a way to get text2image (or another tool) to generate
>>>>> less-than-perfect images?  Or can someone suggest a less labor intensive
>>>>> way of using real check images to train tesseract?
>>>>>
>>>>> Thanks in advance,
>>>>> Keith
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5a6cQGP%2BUM%2BRDqhTMh5HUtHA8nE%3Dhkh-Qu_XvoOX4q8Lg%40mail.gmail.com.

Reply via email to