Hello Junil, >>1.
Tesseract chipped with the following list of trained languages: - Arabic - Bulgarian - Catalan - Czech - Chinese simplified - Chinese traditional - Danish - German - Greek - English - Finnish - French - Hebrew - Hindi - Croatian - Hungarian - Indonesian - Italian - Japanese - Korean - Latvian - Lithuanian - Dutch - Norwegian - And even more >>2. Well described here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >>3. I am not sure what you mean by tesseract spec, however it has the following utilities: tesseract – extracts text or characters from the image. cntraining – generates a normproto and pffmtable. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). It then appends these samples into a separate file for each character. combine_tessdata – creates an unified traineddata file from different files produced by the training process. 1. Usage Description language_data_path_prefix (e.g. tessdata/eng.) Combines all individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs). The result will be a combined tessdata file lang_code.traineddata -e Extracts individual components from a combined trained data file. For instance, combine_tessdata -e tessdata/ell.traineddata -o Overwrites individual components of the given lang_code.traineddata file. Example: combine_tessdata -o tessdata/ell.traineddata -u Unpacks all the components to the specified path. For instance, combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell mftraining – Separates training pages into files for each character. Strips from files only the features and there parameters of the feature type mf. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). The result is a binary file used by the OCR engine. unicharset_extractor – Extracts a character/ligature set. Given a list of box files on the command line, generates a file containing an unicharset, a list of all the characters. The file contains the size of the set on the first line, and then one unichar per line. Usage: unicharset_extractor [-D DIRECTORY] FILE... wordlist2dawg – Generates a DAWG from word list file. Given a file that contains a list of words (one word per line) and generates the corresponding squished DAWG file. Usage: wordlist2dawg [-t | -l min_len max_len] word_list_file dawg_file unicharset_file It also has c++ API to make integration with your software, it located under ../api and called basicapi. Hope it helps. Any way, before doing something please read tesseract README and WIKI. Best regards, Oleg On Mon, Jul 18, 2011 at 10:22 AM, 준일 최 <[email protected]> wrote: > Hi. my name is junil. > Should try to develop using tesseract. > There are a few questions for tesseract. > > 1. support language. > 2. How to add language. > 3. approximate specification of tesseract. > > Answers would be appreciated. > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

