Re: questions of tesseract

Oleg Tikhonov Wed, 20 Jul 2011 18:49:16 -0700

Hey,

>> Supported languages generally do not need to be trained, only if you are
not satisfied by the quality of recognition or it does not include the fonts
of your input image(s).


>>I cannot answer you about hardware requirements, have no idea. My work
stations are Linux with 4G - 8G RAM, kinda servers. I tested it also on
Windows 7 home edition with 4G of RAM. I believe it also works on Android
based devices with less resources.

>>input file format,  output file format, minimum text pixels, etc....
Please refer to the README and WIKI.

Cheers,
Oleg

On Wed, Jul 20, 2011 at 5:34 AM, 최준일 <[email protected]> wrote:

> Dear. Oleg Tikhonov.
>
> Thanks. Your answer was very helpful.
> Can i ask a few questions?
>
> Currently support language is need training?
> And Is there something you know about spec?
>   ex) support CPU, CPU usage(MIPS), ROM usage(generally, DB size),RAM
> usage,input file format,
>   output file format, minimum text pixels, etc....
>
> Again, the answer will help if you tell it is a lot.
> Have a good day~~^^
>
> 2011/7/19 Oleg Tikhonov <[email protected]>
>
>> Hello Junil,
>>
>> >>1.
>>
>>  Tesseract chipped with the following list of trained languages:
>>
>>    -
>>
>>    Arabic
>>    -
>>
>>    Bulgarian
>>    -
>>
>>    Catalan
>>    -
>>
>>    Czech
>>    -
>>
>>    Chinese simplified
>>    -
>>
>>    Chinese traditional
>>    -
>>
>>    Danish
>>    -
>>
>>    German
>>    -
>>
>>    Greek
>>    -
>>
>>    English
>>    -
>>
>>    Finnish
>>    -
>>
>>    French
>>    -
>>
>>    Hebrew
>>    -
>>
>>    Hindi
>>    -
>>
>>    Croatian
>>    -
>>
>>    Hungarian
>>    -
>>
>>    Indonesian
>>    -
>>
>>    Italian
>>    -
>>
>>    Japanese
>>    -
>>
>>    Korean
>>    -
>>
>>    Latvian
>>    -
>>
>>    Lithuanian
>>    -
>>
>>    Dutch
>>    -
>>
>>    Norwegian
>>    -
>>
>>    And even more
>>
>> >>2. Well described here:
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>>
>> >>3. I am not sure what you mean by tesseract spec, however it has the
>> following utilities:
>>
>> tesseract – extracts text or characters from the image.
>>
>> cntraining – generates a normproto and pffmtable. Reads in a text file
>> consisting of feature samples from a training page in the following format:
>> FontName CharName NumberOfFeatureTypes(N). It then appends these samples
>> into a separate file for each character.
>>
>> combine_tessdata – creates an unified traineddata file from different
>> files produced by the training process.
>>
>>    1.
>>
>>    Usage
>>
>>    Description
>>
>>    language_data_path_prefix (e.g. tessdata/eng.)
>>
>>    Combines all individual tessdata components (unicharset, DAWGs,
>>    classifier templates, ambiguities, language configs). The result will be a
>>    combined tessdata file lang_code.traineddata
>>
>>    -e
>>
>>    Extracts individual components from a combined trained data file. For
>>    instance, combine_tessdata -e tessdata/ell.traineddata
>>
>>    -o
>>
>>    Overwrites individual components of the given lang_code.traineddatafile. 
>> Example:
>>
>>    combine_tessdata -o tessdata/ell.traineddata
>>
>>    -u
>>
>>    Unpacks all the components to the specified path. For instance,
>>
>>    combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell
>>
>> mftraining – Separates training pages into files for each character.
>> Strips from files only the features and there parameters of the feature type
>> mf. Reads in a text file consisting of feature samples from a training page
>> in the following format: FontName CharName NumberOfFeatureTypes(N). The
>> result is a binary file used by the OCR engine.
>>
>> unicharset_extractor – Extracts a character/ligature set. Given a list of
>> box files on the command line, generates a file containing an unicharset, a
>> list of all the characters. The file contains the size of the set on the
>> first line, and then one unichar per line.
>> Usage: unicharset_extractor [-D DIRECTORY] FILE...
>>
>> wordlist2dawg – Generates a DAWG from word list file. Given a file that
>> contains a list of words (one word per line) and generates the corresponding
>> squished DAWG file.
>> Usage: wordlist2dawg [-t | -l min_len max_len] word_list_file dawg_file
>> unicharset_file
>>
>> It also has  c++ API to make integration with your software, it located
>> under ../api and called basicapi.
>>
>> Hope it helps. Any way, before doing something please read tesseract
>> README and WIKI.
>>
>> Best regards,
>> Oleg
>>
>>
>>
>>
>>
>> On Mon, Jul 18, 2011 at 10:22 AM, 준일 최 <[email protected]> wrote:
>>
>>> Hi. my name is junil.
>>> Should try to develop using tesseract.
>>> There are a few questions for tesseract.
>>>
>>> 1. support language.
>>> 2. How to add language.
>>> 3. approximate specification of tesseract.
>>>
>>> Answers would be appreciated.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>
>>
>>  --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>
>  --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: questions of tesseract

Reply via email to