Thanks for the replay,
If you mean if i have the latin and common unicharset in the tessdata 
direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them and 
placed them in the directory and still getting the same behavior.
I have also tried doing it from my windows machine which has 3.05 version, 
and had same behavior .

On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>
> do u have the common and latin unicharset in ur langdata directory.
>
> See https://github.com/tesseract-ocr/langdata
>
> Try to build the latest 3.05.01 version.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <[email protected] 
> <javascript:>> wrote:
>
>> Hello all!
>> Im trying to train tesseract to recognize a new font in English (
>> supercell-magic).
>> I have created a .tif file and matching .box file using jTessBoxEditor ( 
>> eng.supercell-magic.exp0.tif 
>> and  eng.supercell-magic.exp0.box ), and trained tesseract with them.
>>
>> Here is tesseracts's output:
>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 
>> box.train
>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>> Page 1
>> row xheight=30, but median xheight = 37.5455
>> APPLY_BOXES:
>>    Boxes read from boxfile:    1559
>>    Found 1559 good blobs.
>> Generated training data for 34 words
>> Page 2
>> APPLY_BOXES:
>>    Boxes read from boxfile:    1677
>>    Found 1677 good blobs.
>> Generated training data for 34 words
>> Page 3
>> APPLY_BOXES:
>>    Boxes read from boxfile:    1362
>>    Found 1362 good blobs.
>> Generated training data for 28 words
>>
>>
>> So the next step is to extract the characters using unicharset_extractor.
>> There was a normal output for it :
>> $ unicharset_extractor eng.supercell-magic.exp0.box
>> Extracting unicharset from eng.supercell-magic.exp0.box
>> Wrote unicharset file ./unicharset.
>>
>> But when i view the file, it's mostly 0 and 255, which is not like the 
>> example in the wiki 
>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>>  
>> : 
>> An example of the unicharset file
>>
>> 110
>> NULL 0 NULL 0
>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
>> ...
>>
>>
>> Mine looks more like this:
>>
>> 74
>> NULL 0 NULL 0
>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # Joined [4a 6f 69 6e 65 64 ]
>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Broken
>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # t [74 ]
>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # h [68 ]
>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # a [61 ]
>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # n [6e ]
>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # P [50 ]
>> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # o [6f ]
>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # e [65 ]
>> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # : [3a ]
>> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # r [72 ]
>> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # l [6c ]
>> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # i [69 ]
>> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # 1 [31 ]
>> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0       # N [4e ]
>>
>> Why is that ? Thanks in advances.
>>
>> Im using ubuntu 16.04 with tesseract version:
>>
>> tesseract 3.04.01
>>  leptonica-1.73
>>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 
>> 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>>
>>  I have attached the box and tiff file and the data file, and the unicharset 
>> file.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to