Details like matching parens do not matter, but giving natural examples
gives context for different symbols. Words matter, unless you override that
feature, and the word list / DAWG does provide significant increase in
accuracy.

https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Tesseract has already been trained for quite a few languages and scripts,
so consider reading the archives to see if your work is already done or
some tips have been given already.

https://groups.google.com/forum/?fromgroups#!forum/tesseract-ocr

Good luck!
--Sven


On Wed, Nov 28, 2012 at 9:34 PM, Joe Carter <[email protected]> wrote:

> Hello,
>
> I'm trying to Train Tesseract to recognize a script with over 200 letters.
>
> Is it ok to train Tesseract with gibberish text? Or does the training
> method rely on a probable distribution of characters i.e. Actual writing?
> I'd like to train it with a random distribution of characters where each
> character appears 10-20 times depending on how common it is.
>
> When it comes to punctuation, does the same apply? I know the training
> guide  says to make sure that the punctuation is not grouped together, but
> do the examples of punctuation have to be plausible? For example,
> do parentheses have to be properly matched? e.g. *The (quick brown] fox
> jump over the lazy dog.*
> *
> *
> Thanks.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to