I am not sure if you are supposed to use those box files for training purposes. All the guides and manuals I have read use either text2image script, or the manual method(which is presumably outdated method).
On Wednesday, October 18, 2023 at 6:27:58 PM UTC+3 Keith Smith wrote: > I tried using tesstrain but am not getting 0% accuracy, so any help on > what I'm doing wrong or misunderstanding would be greatly appreciated. > > Specifically, here is what I did given my 20K check images and data from > my x9.37 file. For each check, I > 1. cropped the image so that they included only the bottom of the check > with the MICR line > 2. generated the gt.txt file based on the values for the check from the > x9.37 file associated with the MICR line > 3. ran "make training MODEL_NAME=micr_e13b" until it terminated. The BCER > was at about 34%. > > I then used the resulting micr_d13b.traineddata file but it yielded dismal > results. So I looked at the box files that were generated, and each of > them had the same coordinates for each character which covered the entire > image area. > So I looked at the generate_line_box.py script and it seems that is what > it is coded to do from looking at > https://github.com/tesseract-ocr/tesstrain/blob/main/generate_line_box.py#L26 > > Shouldn't the box file coordinates be different for each character? > > Thanks, > Keith > > On Fri, Oct 13, 2023 at 10:59 AM Keith Smith <[email protected]> wrote: > >> Thanks Shree for the clarification. I'll give it a try. I was following >> https://github.com/tesseract-ocr/tessdoc/blob/main/tess5/TrainingTesseract-5.md >> >> and obviously misunderstood. >> >> On Fri, Oct 13, 2023 at 7:54 AM Shree Devi Kumar <[email protected]> >> wrote: >> >>> See also >>> >>> https://github.com/tesseract-ocr/tesstrain/wiki >>> >>> It has details about training using the makefile. >>> >>> On Fri, Oct 13, 2023, 3:43 PM Keith Smith <[email protected]> wrote: >>> >>>> Yes I have. I am asking about how to automate the generation of the >>>> ground truth images and box files, because from what I understand, >>>> tesseract requires on the order of 10K images and box files to train on. >>>> However, unless I am missing something, what I read at >>>> https://github.com/tesseract-ocr/tesstrain assumes the ground truth >>>> (images + box files) already exist. >>>> >>>> On Fri, Oct 13, 2023 at 1:00 AM Shree Devi Kumar <[email protected]> >>>> wrote: >>>> >>>>> Have you looked at >>>>> >>>>> https://github.com/tesseract-ocr/tesstrain >>>>> >>>>> >>>>> >>>>> On Thu, Oct 12, 2023, 11:45 PM Keith Smith <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I am trying to use tesseract to OCR the MICR line of checks (i.e. the >>>>>> micr-e13b font). The training data that I found at >>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR/blob/master/Tessdata/mcr.traineddata >>>>>> >>>>>> does not produce accurate results on my data set. >>>>>> >>>>>> I have a set of over 20K check images along with the MICR text for >>>>>> those images; however, I do not have box files for them. >>>>>> >>>>>> So I started generating box files and manually correcting them via >>>>>> JTessBoxEditor, but I soon learned that it would take a LONG time to do >>>>>> this for enough checks to properly train tesseract. So I am just >>>>>> started >>>>>> generating synthetic images using tesseract's text2image; however, the >>>>>> images generated are perfect (i.e. no blur, skew, etc), so I am doubting >>>>>> that this will result in training tesseract to handle my >>>>>> less-than-perfect >>>>>> check images. >>>>>> >>>>>> Does anyone have suggestions for the best methodology to use? Is >>>>>> there a way to get text2image (or another tool) to generate >>>>>> less-than-perfect images? Or can someone suggest a less labor intensive >>>>>> way of using real check images to train tesseract? >>>>>> >>>>>> Thanks in advance, >>>>>> Keith >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b92d2ab9-3da1-4ef8-bafe-5217821c5601n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWVBZ-FGXZUTwTX56DQvwtCY9rB%2BuPTjjok62u2BEF%3DzA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5aGd6P1CCF0y5ufakhbDzSzbBQNF7A4iECnu4dFdsC0rQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU-qKCUG5wBTN3ke1NFN4_5aG6arF1HabHE12vZngby0A%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6568fadc-613a-403d-a4d9-f556648d9f23n%40googlegroups.com.

