[tesseract-ocr] What is difference between "unicharset file" and "lstm-unicharset file"

이경준 Thu, 01 Mar 2018 16:53:37 -0800

Hi . Thank you for seeing my questions 

1. What is difference between 'unicharset' and 'lstm-unicharset' ?


I know to make 'unicharset' by command line : "$ tesseract 
(lang).(filename).exp(num).tif  (lang).(filename).exp(num).box

But I don't know to make 'lstm-unicharset'  ???

cf) .tr -> .lstmf

I apply this command line = "$tesseract (lang).(filename).exp(num).tif 
(lang).(filename).exp(num) nobatch *box.train*" to tesseract 
(lang).(filename).exp(num).tif (lang).(filename).exp(num) nobatch
*lstm.train*"

2. This usage is right? 

Is it possible to apply 'unicharset' to 'lstm-unicharset'



3. In the github wiki passage

Overview of Training Process

The overall training process is similar to training 3.04 
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>.

Conceptually the same:

   1. Prepare training text. 
   
<https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>
   2. Render text to image + box file. (Or create hand-made box files for 
   existing image data.)
   3. Make unicharset file. (Can be partially specified, ie created 
   manually).
   4. Make a starter traineddata from the unicharset and optional 
   dictionary data. 
   
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>
   5. Run tesseract to process image + box file to make training data set.
   6. Run training on training data set.
   7. Combine data files.

The key differences are:

   - The boxes only need to be at the *textline level.* It is thus *far 
   easier* to make training data from existing image data.
   - The .tr files are replaced by .lstmf data files.
   - Fonts *can and should be mixed freely* instead of being separate.
   - The clustering steps (mftraining, cntraining, shapeclustering) are 
   replaced with a single slow lstmtraining step.


I think that In The key differecen section "unicharset" are replace by 
"lstm-unicharset"  - sentence is added 

Am I false???? 



I wait everybody's answers

Thank U. Have a nice day!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5730b272-043b-4abe-8d85-b8f4d96aad33%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] What is difference between "unicharset file" and "lstm-unicharset file"

Reply via email to