[tesseract-ocr] Help with training

Emil Julius Mon, 08 Dec 2014 10:46:43 -0800

Hey, so I'm training for the adaptive classifier (the default engine in 
Tesseract).
But I'm having a bit of trouble with this a documentation is very 
fragmented and/or missing.


I'm training for a very small data set to start with, I thought I just 
start out using arial black until i gather more data on my subject.
I would like to recognize labels on say cosmetics (in danish), which is 
just a list (comma separated words). And only very specific words, in 
particular:

smør,
ost,
yoghurt,
ymer,
ylette,
fløde,
milkshake,
laktose,
mælkesukker,
animalsk fedtstof,
animalsk olie,
smørolie,
bagermargarine,
margarine,
minarine,
risbagemel,
inddampet mælk,
mælkebestanddele,
mælketørstof,
tørmælk,
mælkepulver,
skummetmælkspulver,
sødmælkspulver,
mælkeprotein,
lactalbumin,
kasein,
kaseinat,
calciumkaseinat,
kaliumkaseinat,
natriumkaseinat,
valle,
valleprotein,
vallepulver,
mælk,

And the same words starting with a capital letter (example: "Vallepulver").
But I keep having trouble figuring out a proper config file for this type 
of morphology, I though that I should probably utilize the DAWG system as 
accuracy and speed is very important.

So far I took the following steps:

   - Used jTessboxeditor to generate a .box file
   - convert the .box file to a .tr file with *tesseract imagefile 
   filename.exp0,box nobatch box.train*
   - Then extract the unicharset with *unicharset_extractor 
   filename.exp0.box*
   - Create a font property file, with following content: *arial 1 0 0 0 0*
   - Then cluster the character features with "*mftraining" "cntraining"*
   - Renaming all the files to my choosen language name
   - Creating a wordlist containing the above list
   - Converting the wordlist to a lang.words.dawg with *wordlist2dawg*
   - And finally combining the data with *combine_tessdata lang.*

But I'm still expericening very inaccurate results (I'm using scantailor to 
preprocess the images before feeding them to Tesseract), here's the image 
(in .tif format) that I'm currently testing tesseract on:


<https://lh5.googleusercontent.com/-1-94iBkVFoc/VIXw3xMMrbI/AAAAAAAAA9Q/bBSlqTMNERE/s1600/milk_1L.tif>

















The system is only supposed to recognize words from the above list (the 
only match between the list and the image would therefore be "milk").

Any suggestions to what I could be doing wrong/improve (especially in my 
nonexistent config) would be very apreciated as I have been struggling with 
this for quite a while now.

Sincerely a desperate fellow nerd.




-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/01db2ee6-04b6-4171-8e7f-d4eb3236a36a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Help with training

Reply via email to