I am not sure if you are supposed to use those box files for training 
purposes. All the guides and manuals I have read use either text2image 
script, or the manual  method(which is presumably outdated method). 

On Wednesday, October 18, 2023 at 6:27:58 PM UTC+3 Keith Smith wrote:

> I tried using tesstrain but am not getting 0% accuracy, so any help on 
> what I'm doing wrong or misunderstanding would be greatly appreciated.
>
> Specifically, here is what I did given my 20K check images and data from 
> my x9.37 file.  For each check, I
> 1. cropped the image so that they included only the bottom of the check 
> with the MICR line
> 2. generated the gt.txt file based on the values for the check from the 
> x9.37 file associated with the MICR line
> 3. ran "make training MODEL_NAME=micr_e13b" until it terminated.  The BCER 
> was at about 34%.
>
> I then used the resulting micr_d13b.traineddata file but it yielded dismal 
> results.  So I looked at the box files that were generated, and each of 
> them had the same coordinates for each character which covered the entire 
> image area.
> So I looked at the generate_line_box.py script and it seems that is what 
> it is coded to do from looking at 
> https://github.com/tesseract-ocr/tesstrain/blob/main/generate_line_box.py#L26
>
> Shouldn't the box file coordinates be different for each character?
>
> Thanks,
> Keith
>
> On Fri, Oct 13, 2023 at 10:59 AM Keith Smith <[email protected]> wrote:
>
>> Thanks Shree for the clarification.  I'll give it a try.  I was following 
>> https://github.com/tesseract-ocr/tessdoc/blob/main/tess5/TrainingTesseract-5.md
>>  
>> and obviously misunderstood.
>>
>> On Fri, Oct 13, 2023 at 7:54 AM Shree Devi Kumar <[email protected]> 
>> wrote:
>>
>>> See also
>>>
>>> https://github.com/tesseract-ocr/tesstrain/wiki
>>>
>>> It has details about training using the makefile.
>>>
>>> On Fri, Oct 13, 2023, 3:43 PM Keith Smith <[email protected]> wrote:
>>>
>>>> Yes I have.  I am asking about how to automate the generation of the 
>>>> ground truth images and box files, because from what I understand, 
>>>> tesseract requires on the order of 10K images and box files to train on.  
>>>> However, unless I am missing something, what I read at 
>>>> https://github.com/tesseract-ocr/tesstrain assumes the ground truth 
>>>> (images + box files) already exist.  
>>>>
>>>> On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar <[email protected]> 
>>>> wrote:
>>>>
>>>>> Have you looked at 
>>>>>
>>>>> https://github.com/tesseract-ocr/tesstrain
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 12, 2023, 11:45 PM Keith Smith <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to use tesseract to OCR the MICR line of checks (i.e. the 
>>>>>> micr-e13b font).  The training data that I found at 
>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR/blob/master/Tessdata/mcr.traineddata
>>>>>>  
>>>>>> does not produce accurate results on my data set.
>>>>>>
>>>>>> I have a set of over 20K check images along with the MICR text for 
>>>>>> those images; however, I do not have box files for them.
>>>>>>
>>>>>> So I started generating box files and manually correcting them via 
>>>>>> JTessBoxEditor, but I soon learned that it would take a LONG time to do 
>>>>>> this for enough checks to properly train tesseract.  So I am just 
>>>>>> started 
>>>>>> generating synthetic images using tesseract's text2image; however, the 
>>>>>> images generated are perfect (i.e. no blur, skew, etc), so I am doubting 
>>>>>> that this will result in training tesseract to handle my 
>>>>>> less-than-perfect 
>>>>>> check images.
>>>>>>
>>>>>> Does anyone have suggestions for the best methodology to use?  Is 
>>>>>> there a way to get text2image (or another tool) to generate 
>>>>>> less-than-perfect images?  Or can someone suggest a less labor intensive 
>>>>>> way of using real check images to train tesseract?
>>>>>>
>>>>>> Thanks in advance,
>>>>>> Keith
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6568fadc-613a-403d-a4d9-f556648d9f23n%40googlegroups.com.

Reply via email to