Re: [tesseract-ocr] traineddata file size too small, error clue ?

2017-06-15 Thread Andres
Thank you very much for your answer Shree.

One strange thing is that prints things like "Generated training data for 
67 words", but in my words_list file I have just 36 words (one each 
alphanumeric symbol and one each numeric symbol). Could It be because I 
have that repeated in frequent_words_list, so there are 72 words in total ?

--

El jueves, 15 de junio de 2017, 0:31:27 (UTC-3), shree escribió:
>
> Traineddata size will depend on many things, not just number of images.
>
> If your unicharset and number of fonts hasn't changed, then the size maybe 
> similar.
>
> Traineddata file also has the wordlists in it, so if you are using a 
> smaller wordlist compared to the one in original eng.traineddata, size 
> maybe smaller.
>
> You can also try the latest version from 
> https://github.com/UB-Mannheim/tesseract/wiki
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Wed, Jun 14, 2017 at 11:39 PM, Andres > 
> wrote:
>
>> Dear all,
>>
>> I've been training tesseract with a multipage tiff file with 5 pages and 
>> approx 12000 boxes.
>>
>> Now I increased the samples in the tiff file, I have 12 pages and 29241 
>> boxes.
>>
>> My concern is that my previous traineddata file size is 321817 bytes and 
>> the new one is 318022 bytes. I don't know if it should be bigger, as I have 
>> no idea about the file format, but I downloaded one version 
>> of eng.traineddata from the tesseract repository and I see that its size is 
>> 21876572 bytes. Could it be that perhaps it is computing just the results 
>> of the first page ? I see in the log that at least, at the beginning of the 
>> process, it is processing all the pages.
>>
>> I am using Tesseract 3.02 on Windows.
>>
>> I will paste my log here, and below that, my batch file, the one that I 
>> use for training.
>>
>> Log:
>>
>> A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 
>> nobatch bo
>> x.train.stderr
>> Tesseract Open Source OCR Engine v3.02 with Leptonica
>> Page 1 of 12
>> row xheight=88.6667, but median xheight = 59.6
>> row xheight=81.8333, but median xheight = 59.6
>> row xheight=75, but median xheight = 59.6
>> row xheight=71.1875, but median xheight = 59.6
>> row xheight=71.1875, but median xheight = 59.6
>> row xheight=71.1875, but median xheight = 59.6
>> row xheight=68.5333, but median xheight = 59.6
>> row xheight=67., but median xheight = 59.6
>> APPLY_BOXES:
>>Boxes read from boxfile:1671
>>Found 1671 good blobs.
>> TRAINING ... Font name = normal
>> Generated training data for 52 words
>> Page 2 of 12
>> APPLY_BOXES:
>>Boxes read from boxfile:2003
>>Found 2003 good blobs.
>> Generated training data for 58 words
>> Page 3 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't 
>> find a matching blob
>> FAIL!
>> APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't 
>> find a matching blob
>> APPLY_BOXES:
>>Boxes read from boxfile:2128
>>Boxes failed resegmentation:   2
>>Found 2126 good blobs.
>> Generated training data for 60 words
>> Page 4 of 12
>> APPLY_BOXES:
>>Boxes read from boxfile:2257
>>Found 2257 good blobs.
>> Generated training data for 62 words
>> Page 5 of 12
>> APPLY_BOXES:
>>Boxes read from boxfile:2381
>>Found 2381 good blobs.
>> Generated training data for 64 words
>> Page 6 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't 
>> find a matching blob
>> APPLY_BOXES:
>>Boxes read from boxfile:2460
>>Boxes failed resegmentation:   1
>>Found 2459 good blobs.
>> Generated training data for 65 words
>> Page 7 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't 
>> find a matching blob
>> APPLY_BOXES:
>>Boxes read from boxfile:2568
>>Boxes failed resegmentation:   1
>>Found 2567 good blobs.
>> Generated training data for 67 words
>> Page 8 of 12
>> APPLY_BOXES:
>>Boxes read from boxfile:2680
>>Found 2680 good blobs.
>> Generated training data for 68 words
>> Page 9 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't 
>> find a matching blob
>> APPLY_BOXES:
>>Boxes read from boxfile:2818
>>Boxes failed resegmentation:   1
>>Found 2817 good blobs.
>> Generated training data for 70 words
>> Page 10 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! 
>> Couldn't find a matching blob
>> FAIL!
>> APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't 
>> find a matching blob
>> APPLY_BOXES:
>>Boxes read from boxfile:3000
>>Boxes failed resegmentation:   2
>>Found 2998 good blobs.
>> Generated training data for 73 words
>> Page 11 of 12
>> FAIL!
>> APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! 
>

Re: [tesseract-ocr] traineddata file size too small, error clue ?

2017-06-14 Thread ShreeDevi Kumar
Traineddata size will depend on many things, not just number of images.

If your unicharset and number of fonts hasn't changed, then the size maybe
similar.

Traineddata file also has the wordlists in it, so if you are using a
smaller wordlist compared to the one in original eng.traineddata, size
maybe smaller.

You can also try the latest version from
https://github.com/UB-Mannheim/tesseract/wiki

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Jun 14, 2017 at 11:39 PM, Andres  wrote:

> Dear all,
>
> I've been training tesseract with a multipage tiff file with 5 pages and
> approx 12000 boxes.
>
> Now I increased the samples in the tiff file, I have 12 pages and 29241
> boxes.
>
> My concern is that my previous traineddata file size is 321817 bytes and
> the new one is 318022 bytes. I don't know if it should be bigger, as I have
> no idea about the file format, but I downloaded one version
> of eng.traineddata from the tesseract repository and I see that its size is
> 21876572 bytes. Could it be that perhaps it is computing just the results
> of the first page ? I see in the log that at least, at the beginning of the
> process, it is processing all the pages.
>
> I am using Tesseract 3.02 on Windows.
>
> I will paste my log here, and below that, my batch file, the one that I
> use for training.
>
> Log:
>
> A:\training>tesseract.exe patentesar.normal.exp0.tif patentesar.normal.exp0 
> nobatch bo
> x.train.stderr
> Tesseract Open Source OCR Engine v3.02 with Leptonica
> Page 1 of 12
> row xheight=88.6667, but median xheight = 59.6
> row xheight=81.8333, but median xheight = 59.6
> row xheight=75, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=71.1875, but median xheight = 59.6
> row xheight=68.5333, but median xheight = 59.6
> row xheight=67., but median xheight = 59.6
> APPLY_BOXES:
>Boxes read from boxfile:1671
>Found 1671 good blobs.
> TRAINING ... Font name = normal
> Generated training data for 52 words
> Page 2 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2003
>Found 2003 good blobs.
> Generated training data for 58 words
> Page 3 of 12
> FAIL!
> APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2128
>Boxes failed resegmentation:   2
>Found 2126 good blobs.
> Generated training data for 60 words
> Page 4 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2257
>Found 2257 good blobs.
> Generated training data for 62 words
> Page 5 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2381
>Found 2381 good blobs.
> Generated training data for 64 words
> Page 6 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2460
>Boxes failed resegmentation:   1
>Found 2459 good blobs.
> Generated training data for 65 words
> Page 7 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2568
>Boxes failed resegmentation:   1
>Found 2567 good blobs.
> Generated training data for 67 words
> Page 8 of 12
> APPLY_BOXES:
>Boxes read from boxfile:2680
>Found 2680 good blobs.
> Generated training data for 68 words
> Page 9 of 12
> FAIL!
> APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:2818
>Boxes failed resegmentation:   1
>Found 2817 good blobs.
> Generated training data for 70 words
> Page 10 of 12
> FAIL!
> APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:3000
>Boxes failed resegmentation:   2
>Found 2998 good blobs.
> Generated training data for 73 words
> Page 11 of 12
> FAIL!
> APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE! Couldn't 
> find a matching blob
> FAIL!
> APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE! Couldn't 
> find a matching blob
> APPLY_BOXES:
>Boxes read from boxfile:3370
>Boxes failed resegmentation:   4
>Found 3366 good blobs.
> Generated training data for 77 words
> Page 12 of 12
> row xheight=28.6667, but median

[tesseract-ocr] traineddata file size too small, error clue ?

2017-06-14 Thread Andres
Dear all,

I've been training tesseract with a multipage tiff file with 5 pages and
approx 12000 boxes.

Now I increased the samples in the tiff file, I have 12 pages and 29241
boxes.

My concern is that my previous traineddata file size is 321817 bytes and
the new one is 318022 bytes. I don't know if it should be bigger, as I have
no idea about the file format, but I downloaded one version
of eng.traineddata from the tesseract repository and I see that its size is
21876572 bytes. Could it be that perhaps it is computing just the results
of the first page ? I see in the log that at least, at the beginning of the
process, it is processing all the pages.

I am using Tesseract 3.02 on Windows.

I will paste my log here, and below that, my batch file, the one that I use
for training.

Log:

A:\training>tesseract.exe patentesar.normal.exp0.tif
patentesar.normal.exp0 nobatch bo
x.train.stderr
Tesseract Open Source OCR Engine v3.02 with Leptonica
Page 1 of 12
row xheight=88.6667, but median xheight = 59.6
row xheight=81.8333, but median xheight = 59.6
row xheight=75, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=71.1875, but median xheight = 59.6
row xheight=68.5333, but median xheight = 59.6
row xheight=67., but median xheight = 59.6
APPLY_BOXES:
   Boxes read from boxfile:1671
   Found 1671 good blobs.
TRAINING ... Font name = normal
Generated training data for 52 words
Page 2 of 12
APPLY_BOXES:
   Boxes read from boxfile:2003
   Found 2003 good blobs.
Generated training data for 58 words
Page 3 of 12
FAIL!
APPLY_BOXES: boxfile line 358/0 ((383,4901),(428,4980)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 529/D ((146,4401),(187,4480)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:2128
   Boxes failed resegmentation:   2
   Found 2126 good blobs.
Generated training data for 60 words
Page 4 of 12
APPLY_BOXES:
   Boxes read from boxfile:2257
   Found 2257 good blobs.
Generated training data for 62 words
Page 5 of 12
APPLY_BOXES:
   Boxes read from boxfile:2381
   Found 2381 good blobs.
Generated training data for 64 words
Page 6 of 12
FAIL!
APPLY_BOXES: boxfile line 2070/D ((2141,967),(2182,1037)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:2460
   Boxes failed resegmentation:   1
   Found 2459 good blobs.
Generated training data for 65 words
Page 7 of 12
FAIL!
APPLY_BOXES: boxfile line 2082/B ((867,1084),(910,1151)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:2568
   Boxes failed resegmentation:   1
   Found 2567 good blobs.
Generated training data for 67 words
Page 8 of 12
APPLY_BOXES:
   Boxes read from boxfile:2680
   Found 2680 good blobs.
Generated training data for 68 words
Page 9 of 12
FAIL!
APPLY_BOXES: boxfile line 2391/D ((1184,910),(1220,973)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:2818
   Boxes failed resegmentation:   1
   Found 2817 good blobs.
Generated training data for 70 words
Page 10 of 12
FAIL!
APPLY_BOXES: boxfile line 1248/0 ((1468,3440),(1502,3501)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2211/0 ((342,1491),(382,1550)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:3000
   Boxes failed resegmentation:   2
   Found 2998 good blobs.
Generated training data for 73 words
Page 11 of 12
FAIL!
APPLY_BOXES: boxfile line 1280/6 ((2054,3645),(2087,3702)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2750/0 ((496,1051),(528,1105)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3098/D ((2229,530),(2254,583)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3347/Q ((1167,90),(1197,142)): FAILURE!
Couldn't find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:3370
   Boxes failed resegmentation:   4
   Found 3366 good blobs.
Generated training data for 77 words
Page 12 of 12
row xheight=28.6667, but median xheight = 33.5161
row xheight=28.0889, but median xheight = 33.5161
row xheight=27.1, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
row xheight=29, but median xheight = 33.5161
FAIL!
APPLY_BOXES: boxfile line 0/P ((20,5928),(52,5980)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 1/7 ((73,5928),(89,5980)): FAILURE! Couldn't
find a matching blob
FAIL!
APPLY_BOXES: boxfile line 2/4 ((110,5928),(141,5980)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 3/1 ((162,5928),(189,5980)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 44/M ((20,5855),(48,5907)): FAILURE!
Couldn't find a matching blob
FAIL!
APPLY_BOXES: boxfile line 45/M ((69,5855),(96,5907)): FAILURE!
Couldn't find a matching blob