Re: [tesseract-ocr] unicharset_extractor extracting zero values

David Barishev Mon, 19 Jun 2017 14:02:55 -0700

hey, i try to build tesseract from source now, and after i have 
built Leptonica, i couldn't build tesseract with this error :


/bin/bash ../libtool  --tag=CXX   --mode=link g++  -g -O2 -std=c++11   -o 
tesseract tesseract-tesseractmain.o libtesseract.la  -lrt -lpthread 
libtool: link: g++ -g -O2 -std=c++11 -o .libs/tesseract 
tesseract-tesseractmain.o  ./.libs/libtesseract.so -lrt -lpthread
/usr/bin/ld: tesseract-tesseractmain.o: undefined reference to symbol 
'lept_free'
//usr/local/lib/liblept.so.5: error adding symbols: DSO missing from 
command line
collect2: error: ld returned 1 exit status
Makefile:598: recipe for target 'tesseract' failed
make[2]: *** [tesseract] Error 1
make[2]: Leaving directory '/home/david/project/tesseract-3.05.01/api'
Makefile:489: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/david/project/tesseract-3.05.01'
Makefile:398: recipe for target 'all' failed
make: *** [all] Error 2


Any idea why ? 


On Monday, June 19, 2017 at 6:58:57 PM UTC+3, shree wrote:
>
> I would also suggest that you add spaces between words in your input text,
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 19, 2017 at 9:19 PM, ShreeDevi Kumar <[email protected] 
> <javascript:>> wrote:
>
>> You could also try running training on your windows pc with 3.05.01 
>> using tesstrain.sh using `git for windows` which will provide you a shell 
>> for running bash scripts.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Mon, Jun 19, 2017 at 9:05 PM, ShreeDevi Kumar <[email protected] 
>> <javascript:>> wrote:
>>
>>> Where do you have your source files for english langdata?
>>>
>>> If it is in a directory such as ../langdata/eng/
>>> then put the common.unicharset, latin.unicharset and font_properties etc 
>>> in 
>>> ../langdata
>>>
>>>
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Jun 19, 2017 at 8:34 PM, David Barishev <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> Thanks for the replay,
>>>> If you mean if i have the latin and common unicharset in the tessdata 
>>>> direcotry(  /usr/share/tesseract-ocr/tessdata ),i have downloaded them and 
>>>> placed them in the directory and still getting the same behavior.
>>>> I have also tried doing it from my windows machine which has 3.05 
>>>> version, and had same behavior .
>>>>
>>>> On Monday, June 19, 2017 at 2:58:40 PM UTC+3, shree wrote:
>>>>>
>>>>> do u have the common and latin unicharset in ur langdata directory.
>>>>>
>>>>> See https://github.com/tesseract-ocr/langdata
>>>>>
>>>>> Try to build the latest 3.05.01 version.
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Mon, Jun 19, 2017 at 3:23 PM, David Barishev <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> Hello all!
>>>>>> Im trying to train tesseract to recognize a new font in English (
>>>>>> supercell-magic).
>>>>>> I have created a .tif file and matching .box file using 
>>>>>> jTessBoxEditor ( eng.supercell-magic.exp0.tif and  
>>>>>> eng.supercell-magic.exp0.box ), and trained tesseract with them.
>>>>>>
>>>>>> Here is tesseracts's output:
>>>>>> $ tesseract eng.supercell-magic.exp0.tif eng.supercell-magic.exp0 
>>>>>> box.train
>>>>>> Tesseract Open Source OCR Engine v3.04.01 with Leptonica
>>>>>> Page 1
>>>>>> row xheight=30, but median xheight = 37.5455
>>>>>> APPLY_BOXES:
>>>>>>    Boxes read from boxfile:    1559
>>>>>>    Found 1559 good blobs.
>>>>>> Generated training data for 34 words
>>>>>> Page 2
>>>>>> APPLY_BOXES:
>>>>>>    Boxes read from boxfile:    1677
>>>>>>    Found 1677 good blobs.
>>>>>> Generated training data for 34 words
>>>>>> Page 3
>>>>>> APPLY_BOXES:
>>>>>>    Boxes read from boxfile:    1362
>>>>>>    Found 1362 good blobs.
>>>>>> Generated training data for 28 words
>>>>>>
>>>>>>
>>>>>> So the next step is to extract the characters 
>>>>>> using unicharset_extractor.
>>>>>> There was a normal output for it :
>>>>>> $ unicharset_extractor eng.supercell-magic.exp0.box
>>>>>> Extracting unicharset from eng.supercell-magic.exp0.box
>>>>>> Wrote unicharset file ./unicharset.
>>>>>>
>>>>>> But when i view the file, it's mostly 0 and 255, which is not like 
>>>>>> the example in the wiki 
>>>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract#an-example-of-the-unicharset-file>
>>>>>>  
>>>>>> : 
>>>>>> An example of the unicharset file
>>>>>>
>>>>>> 110
>>>>>> NULL 0 NULL 0
>>>>>> N 5 59,68,216,255,87,236,0,27,104,227 Latin 11 0 1 N
>>>>>> Y 5 59,68,216,255,91,205,0,47,91,223 Latin 33 0 2 Y
>>>>>> 1 8 59,69,203,255,45,128,0,66,74,173 Common 3 2 3 1
>>>>>> 9 8 18,66,203,255,89,156,0,39,104,173 Common 4 2 4 9
>>>>>> a 3 58,65,186,198,85,164,0,26,97,185 Latin 56 0 5 a
>>>>>> ...
>>>>>>
>>>>>>
>>>>>> Mine looks more like this:
>>>>>>
>>>>>> 74
>>>>>> NULL 0 NULL 0
>>>>>> Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0      # Joined [4a 6f 69 6e 
>>>>>> 65 64 ]
>>>>>> |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0         # Broken
>>>>>> t 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # t [74 ]
>>>>>> h 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # h [68 ]
>>>>>> a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # a [61 ]
>>>>>> n 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # n [6e ]
>>>>>> P 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # P [50 ]
>>>>>> o 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # o [6f ]
>>>>>> e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # e [65 ]
>>>>>> : 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # : [3a ]
>>>>>> r 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # r [72 ]
>>>>>> l 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # l [6c ]
>>>>>> i 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # i [69 ]
>>>>>> 1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # 1 [31 ]
>>>>>> N 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0   # N [4e ]
>>>>>>
>>>>>> Why is that ? Thanks in advances.
>>>>>>
>>>>>> Im using ubuntu 16.04 with tesseract version:
>>>>>>
>>>>>> tesseract 3.04.01
>>>>>>  leptonica-1.73
>>>>>>   libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : 
>>>>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0
>>>>>>
>>>>>>  I have attached the box and tiff file and the data file, and the 
>>>>>> unicharset file.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/cd052525-9eb7-4527-b75b-82e1a687997d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected] <javascript:>.
>>>> To post to this group, send email to [email protected] 
>>>> <javascript:>.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3789eb00-d438-4efe-afc3-ce3e3dc60aa2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/54633535-84c3-47b4-9d60-1c081ff0ddd1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] unicharset_extractor extracting zero values

Reply via email to