See the following link to comment by Ray regarding building of Training data
https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 On Fri 6 Jul, 2018, 10:38 PM James Q, <james.quitten...@taina.tech> wrote: > No tool I can think of. What I would do is edit the file in a large text > file editor (such as EmEditor) to remove duplicate words. You could do this > by replacing all spaces for newlines then sorting and removing duplicates. > After that you can randomize the unique list of words, add an appropriate > distribution of punctuation characters and re-edit to create a block of > text wrapped at say 100 characters. There are online tools to do the > randomizing and wrapping. > > Having said this I don't know how valuable it is to have training text > containing specific words. I have been struggling myself to train on > specific word lists without much success. I think training text is just > about a representative distribution of characters. Please let me know if > you have any insights on the wordlists in langdata as I'm a bit hazy there. > > Thanks > James > > > > On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote: >> >> Hello guys. >> >> >> I want to add new language script to Tesseract OCR and researching to >> training data. >> >> >> Then I want to know below things. >> >> 1. Is there any automatic tool that make a langdata training_text and >> wordlist files from massive text? >> 2. Is there any documentation about preparing text data and >> explanation about text data files? I just saw directory langdata/jpn/ and >> there are some files. But I have know idea about this files and how to >> create files like those? What rule should I use create langdata files? >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW1ZK1yzGZz%2BJk%3D7ethQx4pgRnB2akZmTfn9xM%3DcpOyww%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.