Details like matching parens do not matter, but giving natural examples gives context for different symbols. Words matter, unless you override that feature, and the word list / DAWG does provide significant increase in accuracy.
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Tesseract has already been trained for quite a few languages and scripts, so consider reading the archives to see if your work is already done or some tips have been given already. https://groups.google.com/forum/?fromgroups#!forum/tesseract-ocr Good luck! --Sven On Wed, Nov 28, 2012 at 9:34 PM, Joe Carter <[email protected]> wrote: > Hello, > > I'm trying to Train Tesseract to recognize a script with over 200 letters. > > Is it ok to train Tesseract with gibberish text? Or does the training > method rely on a probable distribution of characters i.e. Actual writing? > I'd like to train it with a random distribution of characters where each > character appears 10-20 times depending on how common it is. > > When it comes to punctuation, does the same apply? I know the training > guide says to make sure that the punctuation is not grouped together, but > do the examples of punctuation have to be plausible? For example, > do parentheses have to be properly matched? e.g. *The (quick brown] fox > jump over the lazy dog.* > * > * > Thanks. > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- ``All that is gold does not glitter, not all those who wander are lost; the old that is strong does not wither, deep roots are not reached by the frost. >From the ashes a fire shall be woken, a light from the shadows shall spring; renewed shall be blade that was broken, the crownless again shall be king.” -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

