Hi!
  I've been trying to train tesseract and after a hard day getting all the 
dependencies downloaded and compiled I managed to get so far down the 
training documentation.

  I'm using Ubuntu 14.04LTS and I've downloaded and compiled leptonica-1.70.

  I ended up creating a shell script after compiling and installing 
tesseract and tesseract-training...

---- Start of file (called "commands.sh")...

#!/bin/bash

# Get a copy of Tesseract src code...
#   svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ 
tesseract-ocr-read-only
#
# Make a folder, let's call it 'training_text'
#   mkdir training_text
#   cd training_text
#
# Create a '1.txt' file containing the training text. (Try the Gutenburg 
project).
# Copy 'font_properties' from tesseract-ocr-read-only/training/langdata...
#   cp ../tesseract-ocr-read-only/training/langdata/font_properties .
#
# Run this commands file...
#   commands.sh

# Remove any previously generated files (you will get errors
# if this is the first time you run this, but it's OK)...

rm eng.FreeSans.exp0.box
rm eng.FreeSans.exp0.tif
rm eng.FreeSans.exp0.tr
rm eng.FreeSans.exp0.txt
rm shapetable
rm unicharset
rm unicharset.out

# Try to generate them again...

text2image --text=1.txt -outputbase=eng.FreeSans.exp0 --font='FreeSans' 
--fonts_dir=/usr/share/fonts/truetype/freefont

tesseract eng.FreeSans.exp0.tif eng.FreeSans.exp0 box.train

unicharset_extractor eng.FreeSans.exp0.box

set_unicharset_properties -U unicharset -O unicharset.out 
--script_dir=../tesseract-ocr-read-only/training/langdata

shapeclustering -F font_properties -U unicharset eng.FreeSans.exp0.tr
#shapeclustering -F font_properties -U unicharset.out eng.FreeSans.exp0.tr

mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr
#mftraining -F font_properties -U unicharset.out -O eng.FreeSans.exp0.tr

#cntraining eng.FreeSans.exp0.tr

---- End of file

Once I get down to shaperclustering I can't tell from the documentation 
which unicharset file to use the first one produced or the one produced by 
the 'set_unicharset_properties' command.

Either way the mftraining usually fails, sometimes a second attempt at 
running shapeclustering and mftraining outside of this shell file works, 
but almost every time I get the following error...

---- Start of Error (mftraining)

Error: Illegal malloc request size!
"Fatal error encountered!" == NULL:Error:Assert failed:in file 
globaloc.cpp, line 75
./commands.sh: line 40: 20958 Segmentation fault      (core dumped) 
mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr

---- End of Error

And even worse the cntraining command doesn't work at all...

---- Start of Error (cntraining)

Error: Illegal short name for a feature!
"Fatal error encountered!" == NULL:Error:Assert failed:in file 
globaloc.cpp, line 75
Segmentation fault (core dumped)

---- End of Error

  What am I doing wrong?
  Any help would be appreciated. Also I think adding this kind of shell 
script (or equivalent) to a 'fast start' for training could be useful.

Rob

-- 
-- 
Texthelp Ltd is a limited company registered in Belfast, N. Ireland with 
registration number NI31186 having its registered office and principal 
place of business at Lucas Exchange, 1 Orchard Way, Antrim, N. Ireland, 
BT41 2RU.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/63157b27-eb70-467c-bae9-69b12931dadb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to