So I did both renaming and adding a link in the wiki page. 2019年8月10日土曜日 0時35分14秒 UTC+9 shree: > > I suggest to rename the traineddata file from eng. to e13b or another > similar descriptive name and also add a link to it in the data file > contributions wiki page. > > On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, < > [email protected] <javascript:>> wrote: > >> >> >> On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote: >>> >>> I added eng.traineddata and LICENSE. I used my account name in the >>> license file. I don't know if it's appropriate or not. Please tell me if >>> it's not. >>> >> It's ok. >> Thanks. I'll share our dataset (real life samples) in the coming days. >> >>> >>> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou: >>>> >>>> >>>> >>>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: >>>>> >>>>> Here's my sharing on GitHub. Hope it's of any use for somebody. >>>>> >>>>> https://github.com/ElMagoElGato/tess_e13b_training >>>>> >>>> Thanks for sharing your experience with us. >>>> Is it possible to share your Tesseract model (xxx.traineddata)? >>>> We're building a dataset using real life images like what we have >>>> already done for MRZ ( >>>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset). >>>> Your model would help us to automated the annotation and will speedup >>>> our devs. Off course we'll have to manualy correct the annotations but it >>>> will be faster for us. >>>> Also, please add a license to your repo so that we know if we have >>>> right to use it >>>> >>>>> >>>>> >>>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago: >>>>>> >>>>>> OK, I'll do so. I need to reorganize naming and so on a little bit. >>>>>> Will be out there soon. >>>>>> >>>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago >>>>>>> wrote: >>>>>>>> >>>>>>>> HI, >>>>>>>> >>>>>>>> I'm thinking of sharing it of course. What is the best way to do >>>>>>>> it? After all this, the contribution part of mine is only how I >>>>>>>> prepared >>>>>>>> the training text. Even that is consist of Shree's text and mine. >>>>>>>> The >>>>>>>> instructions and tools I used already exist. >>>>>>>> >>>>>>> If you have a Github account just create a repo and publish the data >>>>>>> and instructions. >>>>>>> >>>>>>>> >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >>>>>>>> >>>>>>>>> Hello, >>>>>>>>> Are you planning to release the dataset or models? >>>>>>>>> I'm working on the same subject and planning to share both under >>>>>>>>> BSD terms >>>>>>>>> >>>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> FWIW, I got to the point where I can feel happy with the >>>>>>>>>> accuracy. As the images of the previous post show, the symbols, >>>>>>>>>> especially >>>>>>>>>> on-us symbol and amount symbol, were causing mix-up each other or to >>>>>>>>>> another character. I added much more more symbols to the training >>>>>>>>>> text and >>>>>>>>>> formed words that start with a symbol. One example is as follows: >>>>>>>>>> >>>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I randomly made 8,000 lines like this. In fine-tuning from eng, >>>>>>>>>> 5,000 iteration was almost good. Amount symbol still is confused a >>>>>>>>>> little >>>>>>>>>> when it's followed by 0. Fine tuning tends to be dragged by small >>>>>>>>>> particles. I'll have to think of something to make further >>>>>>>>>> improvement. >>>>>>>>>> >>>>>>>>>> Training from scratch produced a bit more stable traineddata. It >>>>>>>>>> doesn't get confused with symbols so often but tends to generate >>>>>>>>>> extra >>>>>>>>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>>>>>>>> became >>>>>>>>>> very solid. >>>>>>>>>> >>>>>>>>>> I thought I might have to do image and box file training but I >>>>>>>>>> guess it's not needed this time. >>>>>>>>>> >>>>>>>>>> ElMagoElGato >>>>>>>>>> >>>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>>>>>>>> >>>>>>>>>>> HI, >>>>>>>>>>> >>>>>>>>>>> Well, I read the description of ScrollView ( >>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) >>>>>>>>>>> and it says: >>>>>>>>>>> >>>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> It basically works. But for some reason, it doesn't work on my >>>>>>>>>>> e13b image and ends up with a blue screen. Anyway, it shows each >>>>>>>>>>> box >>>>>>>>>>> separately when a character is consist of multiple boxes. I'd like >>>>>>>>>>> to show >>>>>>>>>>> the box for the whole character. ScrollView doesn't do it, at >>>>>>>>>>> least, yet. >>>>>>>>>>> I'll do it on my own. >>>>>>>>>>> >>>>>>>>>>> ElMagoElGato >>>>>>>>>>> >>>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I got this result from hocr. This is where one of the phantom >>>>>>>>>>>> characters comes from. >>>>>>>>>>>> >>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; >>>>>>>>>>>> x_conf 98.864532'><</span> >>>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; >>>>>>>>>>>> x_conf 99.018097'>;</span> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The firs character is the phantom. It starts with the second >>>>>>>>>>>> character that exists on x axis. The first character only has 3 >>>>>>>>>>>> points >>>>>>>>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>>>>>>>> >>>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> There seem to be some more cases to cause phantom characters. >>>>>>>>>>>> I'll look them in. But I have a trivial question now. I made >>>>>>>>>>>> ScrollView >>>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu. >>>>>>>>>>>> There >>>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen >>>>>>>>>>>> though it >>>>>>>>>>>> briefly shows boxes on the way. Can I use this menu at all? >>>>>>>>>>>> It'll be very >>>>>>>>>>>> useful. >>>>>>>>>>>> >>>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>> >>>>>>>>>>>>> It's great! Perfect! Thanks a lot! >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>>>>>>>> >>>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to >>>>>>>>>>>>>>> the pull request 2554. It shows the candidates for each >>>>>>>>>>>>>>> character but >>>>>>>>>>>>>>> doesn't show bounding box of each character. I only shows the >>>>>>>>>>>>>>> box for a >>>>>>>>>>>>>>> whole word. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see bounding boxes of each character in comments of the >>>>>>>>>>>>>>> pull request 2576. How can I do that? Do I have to look in >>>>>>>>>>>>>>> the source >>>>>>>>>>>>>>> code and manipulate such an output on my own? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I haven't been checking psm too much. Will turn to those >>>>>>>>>>>>>>>> options after I see how it goes with bounding boxes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>>>>>>>> latest >>>>>>>>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>>>>>>>> complex >>>>>>>>>>>>>>>> output. I'll take some to read it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the >>>>>>>>>>>>>>>>> LSTM? We have an algorithm that cleanly gets bounding boxes >>>>>>>>>>>>>>>>> of MRZ >>>>>>>>>>>>>>>>> characters. However the results using psm 10 are worse than >>>>>>>>>>>>>>>>> passing the >>>>>>>>>>>>>>>>> whole line in. Yet when we pass the whole line in we get >>>>>>>>>>>>>>>>> these phantom >>>>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” >>>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is >>>>>>>>>>>>>>>>> expected to >>>>>>>>>>>>>>>>> work well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We both have got the same case. It seems a solution to >>>>>>>>>>>>>>>>>> this problem would save a lot of people. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't >>>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged >>>>>>>>>>>>>>>>>> 3 to 4 days >>>>>>>>>>>>>>>>>> ago. How can I get them? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it >>>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could >>>>>>>>>>>>>>>>>>> not use it in >>>>>>>>>>>>>>>>>>> some other cases. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training >>>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help >>>>>>>>>>>>>>>>>>> if data is >>>>>>>>>>>>>>>>>>> scarce. >>>>>>>>>>>>>>>>>>> Also doing less training might help is the training is >>>>>>>>>>>>>>>>>>> not done correctly. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and >>>>>>>>>>>>>>>>>>> for each "pixel column" does this: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, >>>>>>>>>>>>>>>>>>> this is normal, and another step of the algorithm (beam >>>>>>>>>>>>>>>>>>> search I think) >>>>>>>>>>>>>>>>>>> tries to aggregate back the correct characters. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training >>>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful >>>>>>>>>>>>>>>>>>> analysis of the >>>>>>>>>>>>>>>>>>> bounding boxes might fix some cases. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>>>>>>>> [email protected]> ha scritto: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the >>>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different >>>>>>>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the >>>>>>>>>>>>>>>>>>>> same results anyway. How did you get bounding box for >>>>>>>>>>>>>>>>>>>> each character? >>>>>>>>>>>>>>>>>>>> Alto and lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 >>>>>>>>>>>>>>>>>>>>> maybe this was also improved. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to >>>>>>>>>>>>>>>>>>>>> discard symbols that are clearly duplicated: too small, >>>>>>>>>>>>>>>>>>>>> overlapping, etc. >>>>>>>>>>>>>>>>>>>>> But it was not easy to make it work decently and it is >>>>>>>>>>>>>>>>>>>>> not 100% reliable >>>>>>>>>>>>>>>>>>>>> with false negatives and positives. I cannot share the >>>>>>>>>>>>>>>>>>>>> code and it is quite >>>>>>>>>>>>>>>>>>>>> ugly anyway. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>>>>>>>> [email protected]> ha scritto: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well >>>>>>>>>>>>>>>>>>>>>> using the OCRB that Shree trained on MRZ lines. For >>>>>>>>>>>>>>>>>>>>>> example for a 0 it will >>>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus >>>>>>>>>>>>>>>>>>>>>> outputting 45 >>>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the >>>>>>>>>>>>>>>>>>>>>> bounding box output >>>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added >>>>>>>>>>>>>>>>>>>>>> somewhere that I can >>>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same >>>>>>>>>>>>>>>>>>>>>> bounding box. If anyone >>>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output >>>>>>>>>>>>>>>>>>>>>> doesn’t contain the >>>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later. Before >>>>>>>>>>>>>>>>>>>>>>> doing so, I'd like to investigate results a little bit. >>>>>>>>>>>>>>>>>>>>>>> The hocr and >>>>>>>>>>>>>>>>>>>>>>> lstmbox options give some details of positions of >>>>>>>>>>>>>>>>>>>>>>> characters. The results >>>>>>>>>>>>>>>>>>>>>>> show positions that perfectly correspond to letters in >>>>>>>>>>>>>>>>>>>>>>> the image. But the >>>>>>>>>>>>>>>>>>>>>>> text output contains a character that obviously does >>>>>>>>>>>>>>>>>>>>>>> not exist. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that >>>>>>>>>>>>>>>>>>>>>>> generates far more information. I hope it explains >>>>>>>>>>>>>>>>>>>>>>> what happened with each >>>>>>>>>>>>>>>>>>>>>>> character. I'm yet to read the debug output but I'd >>>>>>>>>>>>>>>>>>>>>>> appreciate it if >>>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's >>>>>>>>>>>>>>>>>>>>>>> really complex. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the >>>>>>>>>>>>>>>>>>>>>>>>> browser. Each of the >>>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are >>>>>>>>>>>>>>>>>>>>>>>>> getting symbols fairly good >>>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers >>>>>>>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest >>>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it? It gets tripple D >>>>>>>>>>>>>>>>>>>>>>>>> very often where >>>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found >>>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh. It >>>>>>>>>>>>>>>>>>>>>>>>> specifies some >>>>>>>>>>>>>>>>>>>>>>>>> parameters for each language. Do you have any >>>>>>>>>>>>>>>>>>>>>>>>> guidance for it? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there >>>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any. I see free fonts and >>>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but >>>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata. Tessdata repository obviously >>>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR >>>>>>>>>>>>>>>>>>>>>>>>>>>> traineddata files. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following >>>>>>>>>>>>>>>>>>>>>>>>>>>>> format, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000; >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> in which symbols are converted to E13B symbols. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> This makes about 12,000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines of training text. It's amazing that this >>>>>>>>>>>>>>>>>>>>>>>>>>>>> thing generates a good >>>>>>>>>>>>>>>>>>>>>>>>>>>>> reader out of nowhere. But then it is not very >>>>>>>>>>>>>>>>>>>>>>>>>>>>> good. For example: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist >>>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image. It's a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> training text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training >>>>>>>>>>>>>>>>>>>>>>>>>>>>> text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method. It seems a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also >>>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> bright future. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMago <[email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> that only has 14 characters, 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols. I thought training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> impact, with your font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%, New >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae6266c4-c422-458a-885e-2d6862fbbd85%40googlegroups.com.

