There's no other way to achieve this except helping Tesseract with
segmentation and feed it with chopped image pieces. Many segmentation
approaches exist, but which you should choose depends on your image
specifics: how long text lines are, whither it is a multicolumn layout
or not, possible skewness and plainness of the whole image and many
more.

Send your sample images to get a more practical advice.

Warm regards,
Dmitri Silaev
www.CustomOCR.com




On Fri, Nov 18, 2011 at 12:59 AM, walter23 <[email protected]> wrote:
> I'm trying to come up with a method to OCR very large images (poster
> sized) with lots of regular sized text... for example 40" wide with 12
> point font.  One big limitation I have is that memory is easily
> exhausted with images that take up half a gigabyte or more of RAM
> (40x30" @ 300DPI is pretty big).
>
> I am trying to find out a smart method of automatically reducing the
> image to continuous regions of text so that I do not chop text lines
> in half (either horizontally or vertically).
>
> One idea was to maybe use page segmentation on a lower resolution
> image and use this page layout to split the image up, but looking at
> the layout results I see some problems with this.
>
> Has anybody tackled this kind of problem before?  Suggestions for
> approaches to take?
>
> Many thanks
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to