Hi 

I have a question about making a traineedata (tesseract 4.0 LSTM)

Tutorial Guide to lstmtraining 
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>Creating
 
Starter Traineddata

NOTE: This is a new step!

Instead of a unicharset and script_dir, lstmtraining now takes a traineddata 
file 
on its command-line, to obtain all the information it needs on the language 
to be learned. The traineddata *must* contain at least an lstm-unicharset
 and lstm-recoder component, and may also contain the three dawg files: 
lstm-punc-dawg 
lstm-word-dawg lstm-number-dawg A config file is also optional. The other 
components, if present, will be ignored and unused.

There is no tool to create the lstm-recoder directly. Instead there is a 
new tool, combine_lang_model which takes as input an input_unicharset and 
script_dir(script_dir points to the langdata directory) and optional word 
list files. It creates the lstm-recoder from the input_unicharset and 
creates all the dawgs, if wordlists are provided, putting everything 
together into a traineddata file.




above the passage  I could not find to make a 'lstm-unicharset' ....... So 
I have no idea 


and. I have a 
question https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 


NOTE Tesseract 4.00 will now run happily with a traineddata file that 
contains *just* lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The 
lstm-*-dawgs are optional, and *none of the other components are required 
or used with OEM_LSTM_ONLY as the OCR engine mode.* No bigrams, unichar 
ambigs or any of the other components are needed or even have any effect if 
present. The only other component that does anything is the lang.config, 
which can affect layout analysis, and sub-languages.

If added to an existing Tesseract traineddata file, the lstm-unicharset doesn't 
have to match the Tesseract unicharset, but the same unicharset must be 
used to train the LSTM and build the lstm-*-dawgs files.




at the end of this wiki passage, trainned data is composed by 'lang.lstm, 
lang.lstm-unicharset, lang.lstm-recoder'(mandatory) /



but firstl `Creating Starter Traineddtat' passage says that trainned data 
is composed by 'lstm-recoder, lstm-unicharset(mandatory) /



Which is sentence is right? 


plz help me.....



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b1de73d9-8cfd-4f70-bcb9-f4dfccb79a9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to