Re: [tesseract-ocr] unicharset_extractor extracting zero values

ShreeDevi Kumar Mon, 19 Jun 2017 08:59:18 -0700

I would also suggest that you add spaces between words in your input text,

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Mon, Jun 19, 2017 at 9:19 PM, ShreeDevi Kumar <[email protected]>
wrote:

> You could also try running training on your windows pc with 3.05.01 using
> tesstrain.sh using `git for windows` which will provide you a shell for
> running bash scripts.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 19, 2017 at 9:05 PM, ShreeDevi Kumar <[email protected]>
> wrote:
>
>> Where do you have your source files for english langdata?
>>
>> If it is in a directory such as ../langdata/eng/
>> then put the common.unicharset, latin.unicharset and font_properties etc
>> in
>> ../langdata
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jun 19, 2017 at 8:34 PM, David Barishev <[email protected]>
>> wrote:
>>
>>> Thanks for the replay,
>>> If you mean if i have the latin and common unicharset in the tessdata
>>> direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them
>>> and placed them in the directory and still getting the same behavior.
>>> I have also tried doing it from my windows machine which has 3.05
>>> version, and had same behavior .
>>>
>>> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>>>>
>>>> do u have the common and latin unicharset in ur langdata directory.
>>>>
>>>> See https://github.com/tesseract-ocr/langdata
>>>>
>>>> Try to build the latest 3.05.01 version.
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello all!
>>>>> Im trying to train tesseract to recognize a new font in English (
>>>>> supercell-magic).
>>>>> I have created a .tif file and matching .box file using jTessBoxEditor
>>>>> ( eng.supercell-magic.exp0.tif and  eng.supercell-magic.exp0.box ),
>>>>> and trained tesseract with them.
>>>>>
>>>>> Here is tesseracts's output:
>>>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0
>>>>> box.train
>>>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>>>>> Page 1
>>>>> row xheight=30, but median xheight = 37.5455
>>>>> APPLY_BOXES:
>>>>>    Boxes read from boxfile:    1559
>>>>>    Found 1559 good blobs.
>>>>> Generated training data for 34 words
>>>>> Page 2
>>>>> APPLY_BOXES:
>>>>>    Boxes read from boxfile:    1677
>>>>>    Found 1677 good blobs.
>>>>> Generated training data for 34 words
>>>>> Page 3
>>>>> APPLY_BOXES:
>>>>>    Boxes read from boxfile:    1362
>>>>>    Found 1362 good blobs.
>>>>> Generated training data for 28 words
>>>>>
>>>>>
>>>>> So the next step is to extract the characters
>>>>> using unicharset_extractor.
>>>>> There was a normal output for it :
>>>>> $ unicharset_extractor eng.supercell-magic.exp0.box
>>>>> Extracting unicharset from eng.supercell-magic.exp0.box
>>>>> Wrote unicharset file ./unicharset.
>>>>>
>>>>> But when i view the file, it's mostly 0 and 255, which is not like the
>>>>> example in the wiki
>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>>>>> :
>>>>> An example of the unicharset file
>>>>>
>>>>> 110
>>>>> NULL 0 NULL 0
>>>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>>>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>>>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>>>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>>>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
>>>>> ...
>>>>>
>>>>>
>>>>> Mine looks more like this:
>>>>>
>>>>> 74
>>>>> NULL 0 NULL 0
>>>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # Joined [4a 6f 69 6e 
>>>>> 65 64 ]
>>>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # Broken
>>>>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # t [74 ]
>>>>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # h [68 ]
>>>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # a [61 ]
>>>>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # n [6e ]
>>>>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # P [50 ]
>>>>> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # o [6f ]
>>>>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # e [65 ]
>>>>> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # : [3a ]
>>>>> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # r [72 ]
>>>>> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # l [6c ]
>>>>> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # i [69 ]
>>>>> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # 1 [31 ]
>>>>> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0    # N [4e ]
>>>>>
>>>>> Why is that ? Thanks in advances.
>>>>>
>>>>> Im using ubuntu 16.04 with tesseract version:
>>>>>
>>>>> tesseract 3.04.01
>>>>>  leptonica-1.73
>>>>>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : 
>>>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>>>>>
>>>>>  I have attached the box and tiff file and the data file, and the 
>>>>> unicharset file.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb
>>>>> 7-4527-b75b-82e1a687997d%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW2P4s4aeiAuukdtmgcHGKCwkOPBg1h2eV%2ByBUO4ADScA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] unicharset_extractor extracting zero values

Reply via email to