Re: [tesseract-ocr] traineddata file size too small, error clue ?
Thank you very much for your answer Shree. One strange thing is that prints things like "Generated training data for 67 words", but in my words_list file I have just 36 words (one each alphanumeric symbol and one each numeric symbol). Could It be because I have that repeated in frequent_words_list, so there are 72 words in total ? -- El jueves, 15 de junio de 2017, 0:31:27 (UTC-3), shree escribió: > > Traineddata size will depend on many things, not just number of images. > > If your unicharset and number of fonts hasn't changed, then the size maybe > similar. > > Traineddata file also has the wordlists in it, so if you are using a > smaller wordlist compared to the one in original eng.traineddata, size > maybe smaller. > > You can also try the latest version from > https://github.com/UB-Mannheim/tesseract/wiki > > ShreeDevi > > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Wed, Jun 14, 2017 at 11:39 PM, Andres > > wrote: > >> Dear all, >> >> I've been training tesseract with a multipage tiff file with 5 pages and >> approx 12000 boxes. >> >> Now I increased the samples in the tiff file, I have 12 pages and 29241 >> boxes. >> >> My concern is that my previous traineddata file size is 321817 bytes and >> the new one is 318022 bytes. I don't know if it should be bigger, as I have >> no idea about the file format, but I downloaded one version >> of eng.traineddata from the tesseract repository and I see that its size is >> 21876572 bytes. Could it be that perhaps it is computing just the results >> of the first page ? I see in the log that at least, at the beginning of the >> process, it is processing all the pages. >> >> I am using Tesseract 3.02 on Windows. >> >> I will paste my log here, and below that, my batch file, the one that I >> use for training. >> >> Log: >> >> A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 >> nobatch bo >> x.train.stderr >> Tesseract Open Source OCR Engine v3.02 with Leptonica >> Page 1 of 12 >> row xheight=88.6667, but median xheight = 59.6 >> row xheight=81.8333, but median xheight = 59.6 >> row xheight=75, but median xheight = 59.6 >> row xheight=71.1875, but median xheight = 59.6 >> row xheight=71.1875, but median xheight = 59.6 >> row xheight=71.1875, but median xheight = 59.6 >> row xheight=68.5333, but median xheight = 59.6 >> row xheight=67., but median xheight = 59.6 >> APPLY_BOXES: >>Boxes read from boxfile:1671 >>Found 1671 good blobs. >> TRAINING ... Font name = normal >> Generated training data for 52 words >> Page 2 of 12 >> APPLY_BOXES: >>Boxes read from boxfile:2003 >>Found 2003 good blobs. >> Generated training data for 58 words >> Page 3 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't >> find a matching blob >> FAIL! >> APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't >> find a matching blob >> APPLY_BOXES: >>Boxes read from boxfile:2128 >>Boxes failed resegmentation: 2 >>Found 2126 good blobs. >> Generated training data for 60 words >> Page 4 of 12 >> APPLY_BOXES: >>Boxes read from boxfile:2257 >>Found 2257 good blobs. >> Generated training data for 62 words >> Page 5 of 12 >> APPLY_BOXES: >>Boxes read from boxfile:2381 >>Found 2381 good blobs. >> Generated training data for 64 words >> Page 6 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't >> find a matching blob >> APPLY_BOXES: >>Boxes read from boxfile:2460 >>Boxes failed resegmentation: 1 >>Found 2459 good blobs. >> Generated training data for 65 words >> Page 7 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't >> find a matching blob >> APPLY_BOXES: >>Boxes read from boxfile:2568 >>Boxes failed resegmentation: 1 >>Found 2567 good blobs. >> Generated training data for 67 words >> Page 8 of 12 >> APPLY_BOXES: >>Boxes read from boxfile:2680 >>Found 2680 good blobs. >> Generated training data for 68 words >> Page 9 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't >> find a matching blob >> APPLY_BOXES: >>Boxes read from boxfile:2818 >>Boxes failed resegmentation: 1 >>Found 2817 good blobs. >> Generated training data for 70 words >> Page 10 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! >> Couldn't find a matching blob >> FAIL! >> APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't >> find a matching blob >> APPLY_BOXES: >>Boxes read from boxfile:3000 >>Boxes failed resegmentation: 2 >>Found 2998 good blobs. >> Generated training data for 73 words >> Page 11 of 12 >> FAIL! >> APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! >
Re: [tesseract-ocr] traineddata file size too small, error clue ?
Traineddata size will depend on many things, not just number of images. If your unicharset and number of fonts hasn't changed, then the size maybe similar. Traineddata file also has the wordlists in it, so if you are using a smaller wordlist compared to the one in original eng.traineddata, size maybe smaller. You can also try the latest version from https://github.com/UB-Mannheim/tesseract/wiki ShreeDevi भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jun 14, 2017 at 11:39 PM, Andres wrote: > Dear all, > > I've been training tesseract with a multipage tiff file with 5 pages and > approx 12000 boxes. > > Now I increased the samples in the tiff file, I have 12 pages and 29241 > boxes. > > My concern is that my previous traineddata file size is 321817 bytes and > the new one is 318022 bytes. I don't know if it should be bigger, as I have > no idea about the file format, but I downloaded one version > of eng.traineddata from the tesseract repository and I see that its size is > 21876572 bytes. Could it be that perhaps it is computing just the results > of the first page ? I see in the log that at least, at the beginning of the > process, it is processing all the pages. > > I am using Tesseract 3.02 on Windows. > > I will paste my log here, and below that, my batch file, the one that I > use for training. > > Log: > > A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 > nobatch bo > x.train.stderr > Tesseract Open Source OCR Engine v3.02 with Leptonica > Page 1 of 12 > row xheight=88.6667, but median xheight = 59.6 > row xheight=81.8333, but median xheight = 59.6 > row xheight=75, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=71.1875, but median xheight = 59.6 > row xheight=68.5333, but median xheight = 59.6 > row xheight=67., but median xheight = 59.6 > APPLY_BOXES: >Boxes read from boxfile:1671 >Found 1671 good blobs. > TRAINING ... Font name = normal > Generated training data for 52 words > Page 2 of 12 > APPLY_BOXES: >Boxes read from boxfile:2003 >Found 2003 good blobs. > Generated training data for 58 words > Page 3 of 12 > FAIL! > APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:2128 >Boxes failed resegmentation: 2 >Found 2126 good blobs. > Generated training data for 60 words > Page 4 of 12 > APPLY_BOXES: >Boxes read from boxfile:2257 >Found 2257 good blobs. > Generated training data for 62 words > Page 5 of 12 > APPLY_BOXES: >Boxes read from boxfile:2381 >Found 2381 good blobs. > Generated training data for 64 words > Page 6 of 12 > FAIL! > APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:2460 >Boxes failed resegmentation: 1 >Found 2459 good blobs. > Generated training data for 65 words > Page 7 of 12 > FAIL! > APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:2568 >Boxes failed resegmentation: 1 >Found 2567 good blobs. > Generated training data for 67 words > Page 8 of 12 > APPLY_BOXES: >Boxes read from boxfile:2680 >Found 2680 good blobs. > Generated training data for 68 words > Page 9 of 12 > FAIL! > APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:2818 >Boxes failed resegmentation: 1 >Found 2817 good blobs. > Generated training data for 70 words > Page 10 of 12 > FAIL! > APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:3000 >Boxes failed resegmentation: 2 >Found 2998 good blobs. > Generated training data for 73 words > Page 11 of 12 > FAIL! > APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't > find a matching blob > FAIL! > APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't > find a matching blob > APPLY_BOXES: >Boxes read from boxfile:3370 >Boxes failed resegmentation: 4 >Found 3366 good blobs. > Generated training data for 77 words > Page 12 of 12 > row xheight=28.6667, but median
[tesseract-ocr] traineddata file size too small, error clue ?
Dear all, I've been training tesseract with a multipage tiff file with 5 pages and approx 12000 boxes. Now I increased the samples in the tiff file, I have 12 pages and 29241 boxes. My concern is that my previous traineddata file size is 321817 bytes and the new one is 318022 bytes. I don't know if it should be bigger, as I have no idea about the file format, but I downloaded one version of eng.traineddata from the tesseract repository and I see that its size is 21876572 bytes. Could it be that perhaps it is computing just the results of the first page ? I see in the log that at least, at the beginning of the process, it is processing all the pages. I am using Tesseract 3.02 on Windows. I will paste my log here, and below that, my batch file, the one that I use for training. Log: A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 nobatch bo x.train.stderr Tesseract Open Source OCR Engine v3.02 with Leptonica Page 1 of 12 row xheight=88.6667, but median xheight = 59.6 row xheight=81.8333, but median xheight = 59.6 row xheight=75, but median xheight = 59.6 row xheight=71.1875, but median xheight = 59.6 row xheight=71.1875, but median xheight = 59.6 row xheight=71.1875, but median xheight = 59.6 row xheight=68.5333, but median xheight = 59.6 row xheight=67., but median xheight = 59.6 APPLY_BOXES: Boxes read from boxfile:1671 Found 1671 good blobs. TRAINING ... Font name = normal Generated training data for 52 words Page 2 of 12 APPLY_BOXES: Boxes read from boxfile:2003 Found 2003 good blobs. Generated training data for 58 words Page 3 of 12 FAIL! APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:2128 Boxes failed resegmentation: 2 Found 2126 good blobs. Generated training data for 60 words Page 4 of 12 APPLY_BOXES: Boxes read from boxfile:2257 Found 2257 good blobs. Generated training data for 62 words Page 5 of 12 APPLY_BOXES: Boxes read from boxfile:2381 Found 2381 good blobs. Generated training data for 64 words Page 6 of 12 FAIL! APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:2460 Boxes failed resegmentation: 1 Found 2459 good blobs. Generated training data for 65 words Page 7 of 12 FAIL! APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:2568 Boxes failed resegmentation: 1 Found 2567 good blobs. Generated training data for 67 words Page 8 of 12 APPLY_BOXES: Boxes read from boxfile:2680 Found 2680 good blobs. Generated training data for 68 words Page 9 of 12 FAIL! APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:2818 Boxes failed resegmentation: 1 Found 2817 good blobs. Generated training data for 70 words Page 10 of 12 FAIL! APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:3000 Boxes failed resegmentation: 2 Found 2998 good blobs. Generated training data for 73 words Page 11 of 12 FAIL! APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile:3370 Boxes failed resegmentation: 4 Found 3366 good blobs. Generated training data for 77 words Page 12 of 12 row xheight=28.6667, but median xheight = 33.5161 row xheight=28.0889, but median xheight = 33.5161 row xheight=27.1, but median xheight = 33.5161 row xheight=29, but median xheight = 33.5161 row xheight=29, but median xheight = 33.5161 row xheight=29, but median xheight = 33.5161 FAIL! APPLY_BOXES: boxfile line 0/P ((20,5928),(52,5980)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 1/7 ((73,5928),(89,5980)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 2/4 ((110,5928),(141,5980)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 3/1 ((162,5928),(189,5980)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 44/M ((20,5855),(48,5907)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 45/M ((69,5855),(96,5907)): FAILURE! Couldn't find a matching blob