Re: [tesseract-ocr] unicharset_extractor extracting zero values

ShreeDevi Kumar Mon, 19 Jun 2017 04:59:37 -0700

do u have the common and latin unicharset in ur langdata directory.

See https://github.com/tesseract-ocr/langdata


Try to build the latest 3.05.01 version.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <[email protected]> wrote:

> Hello all!
> Im trying to train tesseract to recognize a new font in English (
> supercell-magic).
> I have created a .tif file and matching .box file using jTessBoxEditor ( 
> eng.supercell-magic.exp0.tif
> and  eng.supercell-magic.exp0.box ), and trained tesseract with them.
>
> Here is tesseracts's output:
> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 box.train
> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
> Page 1
> row xheight=30, but median xheight = 37.5455
> APPLY_BOXES:
>    Boxes read from boxfile:    1559
>    Found 1559 good blobs.
> Generated training data for 34 words
> Page 2
> APPLY_BOXES:
>    Boxes read from boxfile:    1677
>    Found 1677 good blobs.
> Generated training data for 34 words
> Page 3
> APPLY_BOXES:
>    Boxes read from boxfile:    1362
>    Found 1362 good blobs.
> Generated training data for 28 words
>
>
> So the next step is to extract the characters using unicharset_extractor.
> There was a normal output for it :
> $ unicharset_extractor eng.supercell-magic.exp0.box
> Extracting unicharset from eng.supercell-magic.exp0.box
> Wrote unicharset file ./unicharset.
>
> But when i view the file, it's mostly 0 and 255, which is not like the
> example in the wiki
> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
> :
> An example of the unicharset file
>
> 110
> NULL 0 NULL 0
> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
> ...
>
>
> Mine looks more like this:
>
> 74
> NULL 0 NULL 0
> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # Joined [4a 6f 69 6e 65 64 ]
> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0      # Broken
> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # t [74 ]
> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # h [68 ]
> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # a [61 ]
> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # n [6e ]
> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # P [50 ]
> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # o [6f ]
> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # e [65 ]
> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # : [3a ]
> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # r [72 ]
> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # l [6c ]
> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # i [69 ]
> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # 1 [31 ]
> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # N [4e ]
>
> Why is that ? Thanks in advances.
>
> Im using ubuntu 16.04 with tesseract version:
>
> tesseract 3.04.01
>  leptonica-1.73
>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
> 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>
>  I have attached the box and tiff file and the data file, and the unicharset 
> file.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUMF9J-LDE6SZr6C1ZZka5H8fLzho5wwKOmKdh0y7EV6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] unicharset_extractor extracting zero values

Reply via email to