Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

Shree Devi Kumar Fri, 06 Jul 2018 10:23:01 -0700

See the following link to comment by Ray regarding building of Training data


https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951

On Fri 6 Jul, 2018, 10:38 PM James Q, <[email protected]> wrote:

> No tool I can think of. What I would do is edit the file in a large text
> file editor (such as EmEditor) to remove duplicate words. You could do this
> by replacing all spaces for newlines then sorting and removing duplicates.
> After that you can randomize the unique list of words, add an appropriate
> distribution of punctuation characters and re-edit to create a block of
> text wrapped at say 100 characters. There are online tools to do the
> randomizing and wrapping.
>
> Having said this I don't know how valuable it is to have training text
> containing specific words. I have been struggling myself to train on
> specific word lists without much success. I think training text is just
> about a representative distribution of characters. Please let me know if
> you have any insights on the wordlists in langdata as I'm a bit hazy there.
>
> Thanks
> James
>
>
>
> On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>>
>> Hello guys.
>>
>>
>> I want to add new language script to Tesseract OCR and researching to
>> training data.
>>
>>
>> Then I want to know below things.
>>
>>    1. Is there any automatic tool that make a langdata training_text and
>>    wordlist files from massive text?
>>    2. Is there any documentation about preparing text data and
>>    explanation about text data files? I just saw directory langdata/jpn/ and
>>    there are some files. But I have know idea about this files and how to
>>    create files like those? What rule should I use create langdata files?
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW1ZK1yzGZz%2BJk%3D7ethQx4pgRnB2akZmTfn9xM%3DcpOyww%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Explanation for training_text and wordlist files

Reply via email to