I'm trying to come up with a method to OCR very large images (poster sized) with lots of regular sized text... for example 40" wide with 12 point font. One big limitation I have is that memory is easily exhausted with images that take up half a gigabyte or more of RAM (40x30" @ 300DPI is pretty big).
I am trying to find out a smart method of automatically reducing the image to continuous regions of text so that I do not chop text lines in half (either horizontally or vertically). One idea was to maybe use page segmentation on a lower resolution image and use this page layout to split the image up, but looking at the layout results I see some problems with this. Has anybody tackled this kind of problem before? Suggestions for approaches to take? Many thanks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

