Re: questions of tesseract

Oleg Tikhonov Mon, 18 Jul 2011 10:08:42 -0700

Hello Junil,

>>1.

 Tesseract chipped with the following list of trained languages:

   -

   Arabic
   -

   Bulgarian
   -

   Catalan
   -

   Czech
   -

   Chinese simplified
   -

   Chinese traditional
   -

   Danish
   -

   German
   -

   Greek
   -

   English
   -

   Finnish
   -

   French
   -

   Hebrew
   -

   Hindi
   -

   Croatian
   -

   Hungarian
   -

   Indonesian
   -

   Italian
   -

   Japanese
   -

   Korean
   -

   Latvian
   -

   Lithuanian
   -

   Dutch
   -

   Norwegian
   -

   And even more

>>2. Well described here:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

>>3. I am not sure what you mean by tesseract spec, however it has the
following utilities:

tesseract – extracts text or characters from the image.

cntraining – generates a normproto and pffmtable. Reads in a text file
consisting of feature samples from a training page in the following format:
FontName CharName NumberOfFeatureTypes(N). It then appends these samples
into a separate file for each character.

combine_tessdata – creates an unified traineddata file from different files
produced by the training process.

   1.

   Usage

   Description

   language_data_path_prefix (e.g. tessdata/eng.)

   Combines all individual tessdata components (unicharset, DAWGs,
   classifier templates, ambiguities, language configs). The result will be a
   combined tessdata file lang_code.traineddata

   -e

   Extracts individual components from a combined trained data file. For
   instance, combine_tessdata -e tessdata/ell.traineddata

   -o

   Overwrites individual components of the given lang_code.traineddata file.
   Example:

   combine_tessdata -o tessdata/ell.traineddata

   -u

   Unpacks all the components to the specified path. For instance,

   combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell

mftraining – Separates training pages into files for each character. Strips
from files only the features and there parameters of the feature type mf.
Reads in a text file consisting of feature samples from a training page in
the following format: FontName CharName NumberOfFeatureTypes(N). The result
is a binary file used by the OCR engine.

unicharset_extractor – Extracts a character/ligature set. Given a list of
box files on the command line, generates a file containing an unicharset, a
list of all the characters. The file contains the size of the set on the
first line, and then one unichar per line.
Usage: unicharset_extractor [-D DIRECTORY] FILE...

wordlist2dawg – Generates a DAWG from word list file. Given a file that
contains a list of words (one word per line) and generates the corresponding
squished DAWG file.
Usage: wordlist2dawg [-t | -l min_len max_len] word_list_file dawg_file
unicharset_file

It also has  c++ API to make integration with your software, it located
under ../api and called basicapi.

Hope it helps. Any way, before doing something please read tesseract README
and WIKI.

Best regards,
Oleg

On Mon, Jul 18, 2011 at 10:22 AM, 준일 최 <[email protected]> wrote:

> Hi. my name is junil.
> Should try to develop using tesseract.
> There are a few questions for tesseract.
>
> 1. support language.
> 2. How to add language.
> 3. approximate specification of tesseract.
>
> Answers would be appreciated.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: questions of tesseract

Reply via email to