Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Mon, 12 Aug 2019 19:51:43 -0700

So I did both renaming and adding a link in the wiki page.

2019年8月10日土曜日 0時35分14秒 UTC+9 shree:
>
> I suggest to rename the traineddata file from eng. to e13b or another 
> similar descriptive name and also add a link to it in the data file 
> contributions wiki page.
>
> On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, <
> [email protected] <javascript:>> wrote:
>
>>
>>
>> On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>>>
>>> I added eng.traineddata and LICENSE.  I used my account name in the 
>>> license file.  I don't know if it's appropriate or not.  Please tell me if 
>>> it's not.
>>>
>> It's ok.
>> Thanks. I'll share our dataset (real life samples) in the coming days. 
>>
>>>
>>> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>>>
>>>>
>>>>
>>>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>>>
>>>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>>>
>>>> Thanks for sharing your experience with us.
>>>> Is it possible to share your Tesseract model (xxx.traineddata)?
>>>> We're building a dataset using real life images like what we have 
>>>> already done for MRZ (
>>>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>>>> Your model would help us to automated the annotation and will speedup 
>>>> our devs. Off course we'll have to manualy correct the annotations but it 
>>>> will be faster for us. 
>>>> Also, please add a license to your repo so that we know if we have 
>>>> right to use it
>>>>
>>>>>
>>>>>
>>>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>>>>> Will be out there soon.
>>>>>>
>>>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> HI,
>>>>>>>>
>>>>>>>> I'm thinking of sharing it of course.  What is the best way to do 
>>>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>>>> prepared 
>>>>>>>> the training text.  Even that is consist of Shree's text and mine.  
>>>>>>>> The 
>>>>>>>> instructions and tools I used already exist.
>>>>>>>>
>>>>>>> If you have a Github account just create a repo and publish the data 
>>>>>>> and instructions. 
>>>>>>>
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>> Are you planning to release the dataset or models?
>>>>>>>>> I'm working on the same subject and planning to share both under 
>>>>>>>>> BSD terms
>>>>>>>>>
>>>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> FWIW, I got to the point where I can feel happy with the 
>>>>>>>>>> accuracy. As the images of the previous post show, the symbols, 
>>>>>>>>>> especially 
>>>>>>>>>> on-us symbol and amount symbol, were causing mix-up each other or to 
>>>>>>>>>> another character.  I added much more more symbols to the training 
>>>>>>>>>> text and 
>>>>>>>>>> formed words that start with a symbol.  One example is as follows:
>>>>>>>>>>
>>>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 
>>>>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>>>>> little 
>>>>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small 
>>>>>>>>>> particles.  I'll have to think of something to make further 
>>>>>>>>>> improvement.
>>>>>>>>>>
>>>>>>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>>>>>>> doesn't get confused with symbols so often but tends to generate 
>>>>>>>>>> extra 
>>>>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>>>>> became 
>>>>>>>>>> very solid.
>>>>>>>>>>
>>>>>>>>>> I thought I might have to do image and box file training but I 
>>>>>>>>>> guess it's not needed this time.
>>>>>>>>>>
>>>>>>>>>> ElMagoElGato
>>>>>>>>>>
>>>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>
>>>>>>>>>>> HI,
>>>>>>>>>>>
>>>>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) 
>>>>>>>>>>> and it says:
>>>>>>>>>>>
>>>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It basically works.  But for some reason, it doesn't work on my 
>>>>>>>>>>> e13b image and ends up with a blue screen.  Anyway, it shows each 
>>>>>>>>>>> box 
>>>>>>>>>>> separately when a character is consist of multiple boxes.  I'd like 
>>>>>>>>>>> to show 
>>>>>>>>>>> the box for the whole character.  ScrollView doesn't do it, at 
>>>>>>>>>>> least, yet.  
>>>>>>>>>>> I'll do it on my own.
>>>>>>>>>>>
>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>>>>>>> characters comes from.
>>>>>>>>>>>>
>>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; 
>>>>>>>>>>>> x_conf 98.864532'>&lt;</span>
>>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; 
>>>>>>>>>>>> x_conf 99.018097'>;</span>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>>>>>>> character that exists on x axis.  The first character only has 3 
>>>>>>>>>>>> points 
>>>>>>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>>>>>>
>>>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There seem to be some more cases to cause phantom characters.  
>>>>>>>>>>>> I'll look them in.  But I have a trivial question now.  I made 
>>>>>>>>>>>> ScrollView 
>>>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu. 
>>>>>>>>>>>>  There 
>>>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen 
>>>>>>>>>>>> though it 
>>>>>>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  
>>>>>>>>>>>> It'll be very 
>>>>>>>>>>>> useful.
>>>>>>>>>>>>
>>>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to 
>>>>>>>>>>>>>>> the pull request 2554.  It shows the candidates for each 
>>>>>>>>>>>>>>> character but 
>>>>>>>>>>>>>>> doesn't show bounding box of each character.  I only shows the 
>>>>>>>>>>>>>>> box for a 
>>>>>>>>>>>>>>> whole word.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see bounding boxes of each character in comments of the 
>>>>>>>>>>>>>>> pull request 2576.  How can I do that?  Do I have to look in 
>>>>>>>>>>>>>>> the source 
>>>>>>>>>>>>>>> code and manipulate such an output on my own?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those 
>>>>>>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>>>>>>> latest 
>>>>>>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>>>>>>> complex 
>>>>>>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the 
>>>>>>>>>>>>>>>>> LSTM? We have an algorithm that cleanly gets bounding boxes 
>>>>>>>>>>>>>>>>> of MRZ 
>>>>>>>>>>>>>>>>> characters. However the results using psm 10 are worse than 
>>>>>>>>>>>>>>>>> passing the 
>>>>>>>>>>>>>>>>> whole line in. Yet when we pass the whole line in we get 
>>>>>>>>>>>>>>>>> these phantom 
>>>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” 
>>>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is 
>>>>>>>>>>>>>>>>> expected to 
>>>>>>>>>>>>>>>>> work well. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We both have got the same case.  It seems a solution to 
>>>>>>>>>>>>>>>>>> this problem would save a lot of people.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't 
>>>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 
>>>>>>>>>>>>>>>>>> 3 to 4 days 
>>>>>>>>>>>>>>>>>> ago.  How can I get them?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it 
>>>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could 
>>>>>>>>>>>>>>>>>>> not use it in 
>>>>>>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training 
>>>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help 
>>>>>>>>>>>>>>>>>>> if data is 
>>>>>>>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>>>>>>>> Also doing less training might help is the training is 
>>>>>>>>>>>>>>>>>>> not done correctly.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and 
>>>>>>>>>>>>>>>>>>> for each "pixel column" does this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, 
>>>>>>>>>>>>>>>>>>> this is normal, and another step of the algorithm (beam 
>>>>>>>>>>>>>>>>>>> search I think) 
>>>>>>>>>>>>>>>>>>> tries to aggregate back the correct characters.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training 
>>>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful 
>>>>>>>>>>>>>>>>>>> analysis of the 
>>>>>>>>>>>>>>>>>>> bounding boxes might fix some cases.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>>>>>>> [email protected]> ha scritto:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the 
>>>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different 
>>>>>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the 
>>>>>>>>>>>>>>>>>>>> same results anyway.  How did you get bounding box for 
>>>>>>>>>>>>>>>>>>>> each character?  
>>>>>>>>>>>>>>>>>>>> Alto and lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 
>>>>>>>>>>>>>>>>>>>>> maybe this was also improved.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to 
>>>>>>>>>>>>>>>>>>>>> discard symbols that are clearly duplicated: too small, 
>>>>>>>>>>>>>>>>>>>>> overlapping, etc. 
>>>>>>>>>>>>>>>>>>>>> But it was not easy to make it work decently and it is 
>>>>>>>>>>>>>>>>>>>>> not 100% reliable 
>>>>>>>>>>>>>>>>>>>>> with false negatives and positives. I cannot share the 
>>>>>>>>>>>>>>>>>>>>> code and it is quite 
>>>>>>>>>>>>>>>>>>>>> ugly anyway.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>>>>>>> [email protected]> ha scritto:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well 
>>>>>>>>>>>>>>>>>>>>>> using the OCRB that Shree trained on MRZ lines. For 
>>>>>>>>>>>>>>>>>>>>>> example for a 0 it will 
>>>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>>>>>>> outputting 45 
>>>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>>>>>>> bounding box output 
>>>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added 
>>>>>>>>>>>>>>>>>>>>>> somewhere that I can 
>>>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same 
>>>>>>>>>>>>>>>>>>>>>> bounding box. If anyone 
>>>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>>>>>>> doesn’t contain the 
>>>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before 
>>>>>>>>>>>>>>>>>>>>>>> doing so, I'd like to investigate results a little bit. 
>>>>>>>>>>>>>>>>>>>>>>>  The hocr and 
>>>>>>>>>>>>>>>>>>>>>>> lstmbox options give some details of positions of 
>>>>>>>>>>>>>>>>>>>>>>> characters.  The results 
>>>>>>>>>>>>>>>>>>>>>>> show positions that perfectly correspond to letters in 
>>>>>>>>>>>>>>>>>>>>>>> the image.  But the 
>>>>>>>>>>>>>>>>>>>>>>> text output contains a character that obviously does 
>>>>>>>>>>>>>>>>>>>>>>> not exist.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that 
>>>>>>>>>>>>>>>>>>>>>>> generates far more information.  I hope it explains 
>>>>>>>>>>>>>>>>>>>>>>> what happened with each 
>>>>>>>>>>>>>>>>>>>>>>> character.  I'm yet to read the debug output but I'd 
>>>>>>>>>>>>>>>>>>>>>>> appreciate it if 
>>>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's 
>>>>>>>>>>>>>>>>>>>>>>> really complex.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the 
>>>>>>>>>>>>>>>>>>>>>>>>> browser.  Each of the 
>>>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are 
>>>>>>>>>>>>>>>>>>>>>>>>> getting symbols fairly good 
>>>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers 
>>>>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest 
>>>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it?  It gets tripple D 
>>>>>>>>>>>>>>>>>>>>>>>>> very often where 
>>>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found 
>>>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh.  It 
>>>>>>>>>>>>>>>>>>>>>>>>> specifies some 
>>>>>>>>>>>>>>>>>>>>>>>>> parameters for each language.  Do you have any 
>>>>>>>>>>>>>>>>>>>>>>>>> guidance for it?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there 
>>>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but 
>>>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously 
>>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> traineddata files.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> format,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> in which symbols are converted to E13B symbols.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> This makes about 12,000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines of training text.  It's amazing that this 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> thing generates a good 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> reader out of nowhere.  But then it is not very 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> good.  For example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image.  It's a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> bright future.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMago <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that only has 14 characters, 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> impact, with your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%,  New 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ae6266c4-c422-458a-885e-2d6862fbbd85%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to