Hi! I've been trying to train tesseract and after a hard day getting all the dependencies downloaded and compiled I managed to get so far down the training documentation.
I'm using Ubuntu 14.04LTS and I've downloaded and compiled leptonica-1.70. I ended up creating a shell script after compiling and installing tesseract and tesseract-training... ---- Start of file (called "commands.sh")... #!/bin/bash # Get a copy of Tesseract src code... # svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr-read-only # # Make a folder, let's call it 'training_text' # mkdir training_text # cd training_text # # Create a '1.txt' file containing the training text. (Try the Gutenburg project). # Copy 'font_properties' from tesseract-ocr-read-only/training/langdata... # cp ../tesseract-ocr-read-only/training/langdata/font_properties . # # Run this commands file... # commands.sh # Remove any previously generated files (you will get errors # if this is the first time you run this, but it's OK)... rm eng.FreeSans.exp0.box rm eng.FreeSans.exp0.tif rm eng.FreeSans.exp0.tr rm eng.FreeSans.exp0.txt rm shapetable rm unicharset rm unicharset.out # Try to generate them again... text2image --text=1.txt -outputbase=eng.FreeSans.exp0 --font='FreeSans' --fonts_dir=/usr/share/fonts/truetype/freefont tesseract eng.FreeSans.exp0.tif eng.FreeSans.exp0 box.train unicharset_extractor eng.FreeSans.exp0.box set_unicharset_properties -U unicharset -O unicharset.out --script_dir=../tesseract-ocr-read-only/training/langdata shapeclustering -F font_properties -U unicharset eng.FreeSans.exp0.tr #shapeclustering -F font_properties -U unicharset.out eng.FreeSans.exp0.tr mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr #mftraining -F font_properties -U unicharset.out -O eng.FreeSans.exp0.tr #cntraining eng.FreeSans.exp0.tr ---- End of file Once I get down to shaperclustering I can't tell from the documentation which unicharset file to use the first one produced or the one produced by the 'set_unicharset_properties' command. Either way the mftraining usually fails, sometimes a second attempt at running shapeclustering and mftraining outside of this shell file works, but almost every time I get the following error... ---- Start of Error (mftraining) Error: Illegal malloc request size! "Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75 ./commands.sh: line 40: 20958 Segmentation fault (core dumped) mftraining -F font_properties -U unicharset -O eng.FreeSans.exp0.tr ---- End of Error And even worse the cntraining command doesn't work at all... ---- Start of Error (cntraining) Error: Illegal short name for a feature! "Fatal error encountered!" == NULL:Error:Assert failed:in file globaloc.cpp, line 75 Segmentation fault (core dumped) ---- End of Error What am I doing wrong? Any help would be appreciated. Also I think adding this kind of shell script (or equivalent) to a 'fast start' for training could be useful. Rob -- -- Texthelp Ltd is a limited company registered in Belfast, N. Ireland with registration number NI31186 having its registered office and principal place of business at Lucas Exchange, 1 Orchard Way, Antrim, N. Ireland, BT41 2RU. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/63157b27-eb70-467c-bae9-69b12931dadb%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

