And additonal question combine_tessdata -u kor.traineddata
What is that "-u" what is that meaning ?? I can not find that option(flag) .. wiki - github page Could you give me a explanation 2018년 2월 28일 수요일 오후 4시 21분 17초 UTC+9, 이경준 님의 말: > > Hi I'm studying this passage. But I cannot understand what is that > meaning flag "--noextract_font_properties" ? . so I saw the file > /tesseract/training/tesstrain.sh > > But I cannot Find "--noextract_font_properites" > > Here usage : > > # USAGE: > # > # tesstrain.sh > # --fontlist FONTS # A list of fontnames to train on. > # --fonts_dir FONTS_PATH # Path to font files. > # --lang LANG_CODE # ISO 639 code. > # --langdata_dir DATADIR # Path to tesseract/training/langdata > directory. > # --output_dir OUTPUTDIR # Location of output traineddata file. > # --overwrite # Safe to overwrite files in output_dir. > # --linedata_only # Only generate training data for > lstmtraining. > # --run_shape_clustering # Run shape clustering (use for Indic > langs). > # --exposures EXPOSURES # A list of exposure levels to use (e.g. > "-1 0 1"). > # > # OPTIONAL flags for input data. If unspecified we will look for them in > # the langdata_dir directory. > # --training_text TEXTFILE # Text to render and use for training. > # --wordlist WORDFILE # Word list for the language ordered by > # # decreasing frequency. > # > # OPTIONAL flag to specify location of existing traineddata files, required > # during feature extraction. If unspecified will use TESSDATA_PREFIX > defined in > # the current environment. > # --tessdata_dir TESSDATADIR # Path to tesseract/tessdata directory. > # > # NOTE: > # The font names specified in --fontlist need to be recognizable by Pango > using > # fontconfig. An easy way to list the canonical names of all fonts > available on > # your system is to run text2image with --list_available_fonts and the > # appropriate --fonts_dir path. > > > > > > > Using tesstrain > > The setup for running tesstrain.sh > <https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-tesstrain.sh> > is > the same as for base Tesseract. Use --linedata_onlyoption for LSTM > training. Note that it is beneficial to have more training text and make > more pages though, as neural nets don't generalize as well and need to > train on something similar to what they will be running on. If the target > domain is severely limited, then all the dire warnings about needing a lot > of training data may not apply, but the network specification may need to > be changed. > > Training data is created using tesstrain.sh > <https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh> > as > follows: Note that your fonts location may vary. > > training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng --linedata_only > \ > --noextract_font_properties --langdata_dir ../langdata \ > --tessdata_dir ./tessdata --output_dir ~/tesstutorial/engtrain > > > > Thank U Very much . I want to reply Everybody > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/97c9dc09-68bd-4c7f-ad2a-4455109d4d6d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

