Hi Dmitri, Thanks for your response. I figured some kind of custom segmentation was going to be required. Any suggestions you can make to help would be appreciated - I was thinking perhaps I would use some tools from OpenCV or something but I'm not really sure where to read up on segmentation approaches.
Here's a sample image: http://i.imgur.com/6he8V.jpg This is not actually an image I have worked with. It's just a representative sample pulled at random from a web image search, since my sample image contains proprietary information that I can't share. Actual resolution is in the 14,000 x 10,000 range. -Walter On Nov 17, 10:04 pm, Dmitri Silaev <[email protected]> wrote: > There's no other way to achieve this except helping Tesseract with > segmentation and feed it with chopped image pieces. Many segmentation > approaches exist, but which you should choose depends on your image > specifics: how long text lines are, whither it is a multicolumn layout > or not, possible skewness and plainness of the whole image and many > more. > > Send your sample images to get a more practical advice. > > Warm regards, > Dmitri Silaevwww.CustomOCR.com > > > > > > > > On Fri, Nov 18, 2011 at 12:59 AM, walter23 <[email protected]> wrote: > > I'm trying to come up with a method to OCR very large images (poster > > sized) with lots of regular sized text... for example 40" wide with 12 > > point font. One big limitation I have is that memory is easily > > exhausted with images that take up half a gigabyte or more of RAM > > (40x30" @ 300DPI is pretty big). > > > I am trying to find out a smart method of automatically reducing the > > image to continuous regions of text so that I do not chop text lines > > in half (either horizontally or vertically). > > > One idea was to maybe use page segmentation on a lower resolution > > image and use this page layout to split the image up, but looking at > > the layout results I see some problems with this. > > > Has anybody tackled this kind of problem before? Suggestions for > > approaches to take? > > > Many thanks > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

