Please see inline replies: On Sun, May 28, 2017 at 4:53 PM, Akira Hayakawa <[email protected]> wrote:
> I am new to tesseract. My aim is to use this software to analyze Japanese > doc. The idea in my mind is to start from existing model and fine-tune it > by new words that weren't correctly recognized. > > I am reading the Wiki and have some questions. > > 1) > > In https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00---Finetune > > you add training_text to tesstrain.sh > > training/tesstrain.sh \ >> --fonts_dir /usr/share/fonts \ >> --training_text ../langdata/ara/ara.training_text \ >> --langdata_dir ../langdata \ >> --tessdata_dir ./tessdata \ >> --lang ara \ >> --linedata_only \ >> --noextract_font_properties \ >> --exposures "0" \ >> --fontlist "Arial" \ >> --output_dir ~/tesstutorial/aratest > > > but > > In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 > > You don't. Why? > > training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng > --linedata_only \ > --noextract_font_properties --langdata_dir ../langdata \ > --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain > > My understanding is > > 1. tesstrain.sh uses text2image command internally to generate images > which are in various fonts and reshaped. > 2. --linedata_only splits the training text into line and makes images for > each line. > 3. langdata_dir is essential but training_text isn't. If training_test > isn't found, it uses the default $lang/$lang.training_text. > > Am I correct? > Yes, you are correct. > > 2) > > In the above example, I couldn't have an idea why it should take > --tessdata because it seems irrelevant to making training data. > tesseract needs eng and osd traineddata during initialization. The location can be specified via TESSDATA_PREFIX also. > > 3) > > In https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00---Finetune > > It says the reader should place each projects like this > > ./langdata >> ./langdata/eng >> ./langdata/ara >> ./tessdata >> ./tesseract >> ./tesseract/tessdata >> ./tesseract/tessdata/configs/ >> ./tesseract/training >> etc > > That will be the directory structure if you were to clone the tesseract, langdata and tessdata repositories. It is not recommended to clone the whole tessdata repo (over 1 gb), you can download the traineddata files for the languages you need. > > and all the following examples are run under tesseract directory. Then I > think the examples should take ../tessdata as --tessdata_dir but > ./tessdata. I mean the examples should be fixed. > > ./tessdata (in tesseract repo) does not have any traineddata files to begin with. You can change the directories to match your directory configuration. > 4) > > In In https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00---Finetune > > combine_tessdata -e ../tessdata/ara.traineddata \ >> ~/tesstutorial/aratuned_from_ara/ara.lstm > > > This is explained as it extracts the existing LSTM model for Arabic from > tessdata but how come? > The combine_tessdata commands extracts LSTM model because the extension of > the second parameter is .lstm? > Yes. > > Another question here is why LSTM model is mixed in the traineddata? I > think the traineddata file mixes legacy trained model and LSTM model and I > am wondering why they aren't separated? Even if the user only uses LSTM > both trained model are read? (is it light-weight? then it might be ok) > The 4.0 code is in alpha stage of testing and supports both legacy engine and new LSTM engine and the traineddata file has both models. You can use combine_tessdata to keep only the LSTM model in the traineddata. -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2a55760b-371b-483d-b5e2-731110bc83a4%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVqHs9HeBisZm2ikPBN8tnbbaqYrpjg0U0pG6%3DqDYAnDQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

