Dear. Oleg Tikhonov. Thanks. Your answer was very helpful. Can i ask a few questions?
Currently support language is need training? And Is there something you know about spec? ex) support CPU, CPU usage(MIPS), ROM usage(generally, DB size),RAM usage,input file format, output file format, minimum text pixels, etc.... Again, the answer will help if you tell it is a lot. Have a good day~~^^ 2011/7/19 Oleg Tikhonov <[email protected]> > Hello Junil, > > >>1. > > Tesseract chipped with the following list of trained languages: > > - > > Arabic > - > > Bulgarian > - > > Catalan > - > > Czech > - > > Chinese simplified > - > > Chinese traditional > - > > Danish > - > > German > - > > Greek > - > > English > - > > Finnish > - > > French > - > > Hebrew > - > > Hindi > - > > Croatian > - > > Hungarian > - > > Indonesian > - > > Italian > - > > Japanese > - > > Korean > - > > Latvian > - > > Lithuanian > - > > Dutch > - > > Norwegian > - > > And even more > > >>2. Well described here: > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > >>3. I am not sure what you mean by tesseract spec, however it has the > following utilities: > > tesseract - extracts text or characters from the image. > > cntraining - generates a normproto and pffmtable. Reads in a text file > consisting of feature samples from a training page in the following format: > FontName CharName NumberOfFeatureTypes(N). It then appends these samples > into a separate file for each character. > > combine_tessdata - creates an unified traineddata file from different > files produced by the training process. > > 1. > > Usage > > Description > > language_data_path_prefix (e.g. tessdata/eng.) > > Combines all individual tessdata components (unicharset, DAWGs, > classifier templates, ambiguities, language configs). The result will be a > combined tessdata file lang_code.traineddata > > -e > > Extracts individual components from a combined trained data file. For > instance, combine_tessdata -e tessdata/ell.traineddata > > -o > > Overwrites individual components of the given lang_code.traineddatafile. > Example: > > combine_tessdata -o tessdata/ell.traineddata > > -u > > Unpacks all the components to the specified path. For instance, > > combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell > > mftraining - Separates training pages into files for each character. > Strips from files only the features and there parameters of the feature type > mf. Reads in a text file consisting of feature samples from a training page > in the following format: FontName CharName NumberOfFeatureTypes(N). The > result is a binary file used by the OCR engine. > > unicharset_extractor - Extracts a character/ligature set. Given a list of > box files on the command line, generates a file containing an unicharset, a > list of all the characters. The file contains the size of the set on the > first line, and then one unichar per line. > Usage: unicharset_extractor [-D DIRECTORY] FILE... > > wordlist2dawg - Generates a DAWG from word list file. Given a file that > contains a list of words (one word per line) and generates the corresponding > squished DAWG file. > Usage: wordlist2dawg [-t | -l min_len max_len] word_list_file dawg_file > unicharset_file > > It also has c++ API to make integration with your software, it located > under ../api and called basicapi. > > Hope it helps. Any way, before doing something please read tesseract README > and WIKI. > > Best regards, > Oleg > > > > > > On Mon, Jul 18, 2011 at 10:22 AM, 준일 최 <[email protected]> wrote: > >> Hi. my name is junil. >> Should try to develop using tesseract. >> There are a few questions for tesseract. >> >> 1. support language. >> 2. How to add language. >> 3. approximate specification of tesseract. >> >> Answers would be appreciated. >> >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> > > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

