Traineddata size will depend on many things, not just number of images. If your unicharset and number of fonts hasn't changed, then the size maybe similar.
Traineddata file also has the wordlists in it, so if you are using a smaller wordlist compared to the one in original eng.traineddata, size maybe smaller. You can also try the latest version from https://github.com/UB-Mannheim/tesseract/wiki ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jun 14, 2017 at 11:39 PM, Andres <[email protected]> wrote: > Dear all, > > I've been training tesseract with a multipage tiff file with 5 pages and > approx 12000 boxes. > > Now I increased the samples in the tiff file, I have 12 pages and 29241 > boxes. > > My concern is that my previous traineddata file size is 321817 bytes and > the new one is 318022 bytes. I don't know if it should be bigger, as I have > no idea about the file format, but I downloaded one version > of eng.traineddata from the tesseract repository and I see that its size is > 21876572 bytes. Could it be that perhaps it is computing just the results > of the first page ? I see in the log that at least, at the beginning of the > process, it is processing all the pages. > > I am using Tesseract 3.02 on Windows. > > I will paste my log here, and below that, my batch file, the one that I > use for training. > > Log: > > A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 > nobatch bo > x.train.stderr > Tesseract Open Source OCR Engine v3.02 with Leptonica > Page 1 of 12 > row xheight=88.6667, but median xheight = 59.6 > row xheight=81.8333, but median xheight = 59.6 > row xheight=75, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=68.5333, but median xheight = 59.6 > row xheight=67.3333, but median xheight = 59.6 > APPLY_BOXES: > Boxes read from boxfile: 1671 > Found 1671 good blobs. > TRAINING ... Font name = normal > Generated training data for 52 words > Page 2 of 12 > APPLY_BOXES: > Boxes read from boxfile: 2003 > Found 2003 good blobs. > Generated training data for 58 words > Page 3 of 12 > FAIL! > APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 2128 > Boxes failed resegmentation: 2 > Found 2126 good blobs. > Generated training data for 60 words > Page 4 of 12 > APPLY_BOXES: > Boxes read from boxfile: 2257 > Found 2257 good blobs. > Generated training data for 62 words > Page 5 of 12 > APPLY_BOXES: > Boxes read from boxfile: 2381 > Found 2381 good blobs. > Generated training data for 64 words > Page 6 of 12 > FAIL! > APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 2460 > Boxes failed resegmentation: 1 > Found 2459 good blobs. > Generated training data for 65 words > Page 7 of 12 > FAIL! > APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 2568 > Boxes failed resegmentation: 1 > Found 2567 good blobs. > Generated training data for 67 words > Page 8 of 12 > APPLY_BOXES: > Boxes read from boxfile: 2680 > Found 2680 good blobs. > Generated training data for 68 words > Page 9 of 12 > FAIL! > APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 2818 > Boxes failed resegmentation: 1 > Found 2817 good blobs. > Generated training data for 70 words > Page 10 of 12 > FAIL! > APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 3000 > Boxes failed resegmentation: 2 > Found 2998 good blobs. > Generated training data for 73 words > Page 11 of 12 > FAIL! > APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 3370 > Boxes failed resegmentation: 4 > Found 3366 good blobs. > Generated training data for 77 words > Page 12 of 12 > row xheight=28.6667, but median xheight = 33.5161 > row xheight=28.0889, but median xheight = 33.5161 > row xheight=27.1, but median xheight = 33.5161 > row xheight=29, but median xheight = 33.5161 > row xheight=29, but median xheight = 33.5161 > row xheight=29, but median xheight = 33.5161 > FAIL! > APPLY_BOXES: boxfile line 0/P ((20,5928),(52,5980)): FAILURE! Couldn't find a > matching blob > FAIL! > APPLY_BOXES: boxfile line 1/7 ((73,5928),(89,5980)): FAILURE! Couldn't find a > matching blob > FAIL! > APPLY_BOXES: boxfile line 2/4 ((110,5928),(141,5980)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 3/1 ((162,5928),(189,5980)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 44/M ((20,5855),(48,5907)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 45/M ((69,5855),(96,5907)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 46/B ((117,5855),(148,5907)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 47/O ((169,5855),(198,5907)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 90/D ((20,5783),(50,5834)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 91/P ((71,5783),(102,5834)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 92/O ((123,5783),(148,5834)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 93/N ((169,5783),(202,5834)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 136/6 ((20,5711),(46,5762)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 137/P ((67,5711),(103,5762)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 138/X ((124,5711),(146,5762)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 139/M ((167,5711),(190,5762)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 183/M ((20,5639),(51,5690)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 184/1 ((72,5639),(92,5690)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 185/G ((113,5639),(144,5690)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 186/6 ((165,5639),(189,5690)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 229/1 ((20,5567),(44,5618)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 230/T ((65,5567),(89,5618)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 231/N ((110,5567),(141,5618)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 232/O ((162,5567),(196,5618)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 276/T ((20,5496),(44,5546)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 277/F ((65,5496),(91,5546)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 278/G ((112,5496),(140,5546)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 279/5 ((161,5496),(191,5546)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 322/8 ((20,5425),(45,5475)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 323/W ((66,5425),(94,5475)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 324/R ((115,5425),(145,5475)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 325/G ((166,5425),(192,5475)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 370/W ((20,5354),(52,5404)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 371/0 ((73,5354),(102,5404)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 372/G ((123,5354),(155,5404)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 373/H ((176,5354),(201,5404)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 416/2 ((20,5283),(43,5333)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 417/I ((64,5283),(89,5333)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 418/1 ((110,5283),(137,5333)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 419/D ((158,5283),(186,5333)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 463/I ((20,5212),(45,5262)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 464/Q ((66,5212),(92,5262)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 465/K ((113,5212),(144,5262)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 466/E ((165,5212),(186,5262)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 511/G ((20,5142),(48,5191)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 512/Q ((69,5142),(97,5191)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 513/T ((118,5142),(140,5191)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 514/D ((161,5142),(189,5191)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 517/D ((305,5142),(328,5191)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 558/M ((20,5072),(45,5121)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 559/E ((66,5072),(95,5121)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 560/E ((116,5072),(140,5121)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 561/H ((161,5072),(191,5121)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 606/5 ((20,5002),(51,5051)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 607/I ((72,5002),(102,5051)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 608/M ((123,5002),(149,5051)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 609/I ((170,5002),(192,5051)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 653/0 ((20,4932),(50,4981)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 654/0 ((71,4932),(102,4981)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 655/O ((123,4932),(151,4981)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 656/8 ((172,4932),(199,4981)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 700/0 ((20,4862),(49,4911)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 701/W ((70,4862),(93,4911)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 702/0 ((114,4862),(144,4911)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 703/G ((165,4862),(193,4911)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 747/M ((20,4793),(51,4841)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 748/T ((72,4793),(94,4841)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 749/0 ((115,4793),(150,4841)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 750/R ((171,4793),(198,4841)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 795/C ((20,4724),(46,4772)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 796/7 ((67,4724),(96,4772)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 797/1 ((117,4724),(147,4772)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 843/H ((20,4655),(47,4703)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 844/8 ((68,4655),(95,4703)): FAILURE! Couldn't find > a matching blob > FAIL! > APPLY_BOXES: boxfile line 1903/0 ((1824,3398),(1823,3397)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 1904/0 ((1844,3398),(1843,3397)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 1905 > Boxes failed resegmentation: 76 > Found 1829 good blobs. > Generated training data for 48 words > > A:\training>unicharset_extractor patentesar.normal.exp0.box > Extracting unicharset from patentesar.normal.exp0.box > Wrote unicharset file ./unicharset. > Presione una tecla para continuar . . . > > A:\training>mftraining -F font_properties -U unicharset > patentesar.normal.exp0.tr > Read shape table shapetable of 36 shapes > Reading patentesar.normal.exp0.tr ... > Warning: no protos/configs for g in CreateIntTemplates() > Done! > > A:\training>mftraining -F font_properties -U unicharset -O > patentesar.normal.exp0.unic > harset patentesar.normal.exp0.tr > Read shape table shapetable of 36 shapes > Reading patentesar.normal.exp0.tr ... > Warning: no protos/configs for g in CreateIntTemplates() > Done! > Presione una tecla para continuar . . . > > A:\training>cntraining patentesar.normal.exp0.tr > Reading patentesar.normal.exp0.tr ... > Clustering ... > > Writing normproto ... > Presione una tecla para continuar . . . > > A:\training>wordlist2dawg frequent_words_list patentesar.freq-dawg unicharset > Loading unicharset from 'unicharset' > Reading word list from 'frequent_words_list' > Reducing Trie to SquishedDawg > Writing squished DAWG to 'patentesar.freq-dawg' > Presione una tecla para continuar . . . > > A:\training>wordlist2dawg words_list patentesar.word-dawg unicharset > Loading unicharset from 'unicharset' > Reading word list from 'words_list' > Reducing Trie to SquishedDawg > Writing squished DAWG to 'patentesar.word-dawg' > Presione una tecla para continuar . . . > > A:\training>copy /Y normproto patentesar.normal.exp0.normproto > 1 archivo(s) copiado(s). > > A:\training>copy /Y inttemp patentesar.normal.exp0.inttemp > 1 archivo(s) copiado(s). > > A:\training>copy /Y pffmtable patentesar.normal.exp0.pffmtable > 1 archivo(s) copiado(s). > > A:\training>copy /Y Microfeat patentesar.normal.exp0.Microfeat > El sistema no puede encontrar el archivo especificado. > > A:\training>copy /Y shapetable patentesar.normal.exp0.shapetable > 1 archivo(s) copiado(s). > > A:\training>copy /Y unicharset patentesar.normal.exp0 > 1 archivo(s) copiado(s). > > A:\training>copy /Y patentesar.normal.exp0.unicharset patentesar.normal.exp0 > 1 archivo(s) copiado(s). > > A:\training>move /Y patentesar.normal.exp0.normproto tessdata > Se han movido 1 archivos. > > A:\training>move /Y patentesar.normal.exp0.inttemp tessdata > Se han movido 1 archivos. > > A:\training>move /Y patentesar.normal.exp0.pffmtable tessdata > Se han movido 1 archivos. > > A:\training>move /Y patentesar.normal.exp0.Microfeat tessdata > El sistema no puede encontrar el archivo especificado. > > A:\training>move /Y patentesar.normal.exp0.shapetable tessdata > Se han movido 1 archivos. > > A:\training>move /Y unicharset tessdata > Se han movido 1 archivos. > > A:\training>move /Y patentesar.normal.exp0.unicharset tessdata > Se han movido 1 archivos. > Presione una tecla para continuar . . . > > A:\training>combine_tessdata tessdata/patentesar.normal.exp0. > Combining tessdata files > TessdataManager combined tesseract data files. > Offset for type 0 is -1 > Offset for type 1 is 140 > Offset for type 2 is -1 > Offset for type 3 is 2559 > Offset for type 4 is 309717 > Offset for type 5 is 309988 > Offset for type 6 is -1 > Offset for type 7 is -1 > Offset for type 8 is -1 > Offset for type 9 is -1 > Offset for type 10 is -1 > Offset for type 11 is -1 > Offset for type 12 is -1 > Offset for type 13 is 317370 > Offset for type 14 is -1 > Offset for type 15 is -1 > Offset for type 16 is -1 > Presione una tecla para continuar . . . > > Batch file: > > @rem ############################# > @call set_environment.cmd@SET PATH="%TESSDATA_PREFIX%";%PATH% > > tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 nobatch > box.train.stderr@pause > > unicharset_extractor patentesar.normal.exp0.box@pause > > mftraining -F font_properties -U unicharset patentesar.normal.exp0.tr > mftraining -F font_properties -U unicharset -O > patentesar.normal.exp0.unicharset patentesar.normal.exp0.tr@pause > > cntraining patentesar.normal.exp0.tr@pause > > wordlist2dawg frequent_words_list patentesar.freq-dawg unicharset@pause > > wordlist2dawg words_list patentesar.word-dawg unicharset@pause > copy /Y normproto patentesar.normal.exp0.normproto copy /Y inttemp > patentesar.normal.exp0.inttemp copy /Y pffmtable > patentesar.normal.exp0.pffmtable copy /Y Microfeat > patentesar.normal.exp0.Microfeatcopy /Y shapetable > patentesar.normal.exp0.shapetable > copy /Y unicharset patentesar.normal.exp0copy /Y > patentesar.normal.exp0.unicharset patentesar.normal.exp0 > move /Y patentesar.normal.exp0.normproto tessdatamove /Y > patentesar.normal.exp0.inttemp tessdatamove /Y > patentesar.normal.exp0.pffmtable tessdatamove /Y > patentesar.normal.exp0.Microfeat tessdatamove /Y > patentesar.normal.exp0.shapetable tessdata > move /Y unicharset tessdatamove /Y patentesar.normal.exp0.unicharset tessdata > > > @pause > combine_tessdata tessdata/patentesar.normal.exp0. > @pausecopy tessdata\patentesar.normal.exp0.traineddata > "%TESSDATA_PREFIX%"\tessdata" > @pause > tesseract patentesar.normal.exp0.tif output -l patentesar.normal.exp0 > type output.txt > > > > Best regards and thank you, > > Andres > > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/CALk3cjShXCkVdOz87_Oyscxy-qTVrZuwc1cUm%3DBy1MKH1hQfQg% > 40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CALk3cjShXCkVdOz87_Oyscxy-qTVrZuwc1cUm%3DBy1MKH1hQfQg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWjW9cztGdPgiZXXOYw_LGSN0y76zLyqJmDq_X65aLohw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

