[tesseract-ocr] I have a question about making a traineddata (tesseract 4.0 LSTM)

이경준 Wed, 28 Feb 2018 20:02:25 -0800

Hi 

I have a question about making a traineedata (tesseract 4.0 LSTM)

Tutorial Guide to lstmtraining
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata>Creating

Starter Traineddata

NOTE: This is a new step!

Instead of a unicharset and script_dir, lstmtraining now takes a traineddata
file
on its command-line, to obtain all the information it needs on the language
to be learned. The traineddata *must* contain at least an lstm-unicharset
and lstm-recoder component, and may also contain the three dawg files:
lstm-punc-dawg
lstm-word-dawg lstm-number-dawg A config file is also optional. The other
components, if present, will be ignored and unused.

There is no tool to create the lstm-recoder directly. Instead there is a
new tool, combine_lang_model which takes as input an input_unicharset and
script_dir(script_dir points to the langdata directory) and optional word
list files. It creates the lstm-recoder from the input_unicharset and
creates all the dawgs, if wordlists are provided, putting everything
together into a traineddata file.

above the passage I could not find to make a 'lstm-unicharset' ....... So
I have no idea

and. I have a
question https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

NOTE Tesseract 4.00 will now run happily with a traineddata file that
contains *just* lang.lstm, lang.lstm-unicharset and lang.lstm-recoder. The
lstm-*-dawgs are optional, and *none of the other components are required
or used with OEM_LSTM_ONLY as the OCR engine mode.* No bigrams, unichar
ambigs or any of the other components are needed or even have any effect if
present. The only other component that does anything is the lang.config,
which can affect layout analysis, and sub-languages.

If added to an existing Tesseract traineddata file, the lstm-unicharset doesn't
have to match the Tesseract unicharset, but the same unicharset must be
used to train the LSTM and build the lstm-*-dawgs files.

at the end of this wiki passage, trainned data is composed by 'lang.lstm,
lang.lstm-unicharset, lang.lstm-recoder'(mandatory) /

but firstl `Creating Starter Traineddtat' passage says that trainned data
is composed by 'lstm-recoder, lstm-unicharset(mandatory) /

Which is sentence is right?

plz help me.....

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b1de73d9-8cfd-4f70-bcb9-f4dfccb79a9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] I have a question about making a traineddata (tesseract 4.0 LSTM)

Reply via email to