Cutting off the borders and possibly adding white borders might help.
Normalizing out the text that bleeds through the page would also help.
The text is clear, so you might not need to retrain.
--Sven

On Fri, Nov 2, 2012 at 10:32 AM, Devin Bean <[email protected]> wrote:
> Hi,
>
> Apologies for the noob questions. Trying to get the hang of Tesseract.
>
> I have a number of images of Chinese genealogies that I'd love to be able to
> run OCR on. Most of them are similar to the two images linked below:
> wood-block fairly standard print, or, for newer images, actually printed
> standard font.
>
> Wood block print: http://www.flickr.com/photos/63588871@N05/8138563082/
> Standard font print: http://www.flickr.com/photos/63588871@N05/8147864815/
>
> Questions
> - What options do I use to tell Tesseract to read top-to-bottom,
> left-to-right? (I'm using Tesseract 3.02)
> - I expect that Tesseract will need to be train for the wood block texts at
> least. I can edit these images so that just the central text portion remains
> and so that the contrast is greater between the background and the
> characters. I can also generate text files with the characters in the image.
> How do I construct training files that use images where the lines are
> top-to-bottom and left-to-right?
>
> If you have any other advice for processing images like these, I'd really
> appreciate it.
>
> Thanks for your help!
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to