No tool I can think of. What I would do is edit the file in a large text 
file editor (such as EmEditor) to remove duplicate words. You could do this 
by replacing all spaces for newlines then sorting and removing duplicates. 
After that you can randomize the unique list of words, add an appropriate 
distribution of punctuation characters and re-edit to create a block of 
text wrapped at say 100 characters. There are online tools to do the 
randomizing and wrapping.

Having said this I don't know how valuable it is to have training text 
containing specific words. I have been struggling myself to train on 
specific word lists without much success. I think training text is just 
about a representative distribution of characters. Please let me know if 
you have any insights on the wordlists in langdata as I'm a bit hazy there.

Thanks
James



On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>
> Hello guys.
>
>
> I want to add new language script to Tesseract OCR and researching to 
> training data.
>
>
> Then I want to know below things.
>
>    1. Is there any automatic tool that make a langdata training_text and 
>    wordlist files from massive text?
>    2. Is there any documentation about preparing text data and 
>    explanation about text data files? I just saw directory langdata/jpn/ and 
>    there are some files. But I have know idea about this files and how to 
>    create files like those? What rule should I use create langdata files?
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to