Hi Albert, Tesseract cannot read display formulas, its fundamental model is only linear. Unless and until that changes, the best you can hope for is recognizing symbols in text, and you will have to watch out for problems with superscripts and subscripts.
There is a project which claims to have that capability (here: http://www.inftyproject.org/en/software.html#InftyReader) but it isn't Free Software and only runs on Windows machines, and I haven't any personal experience with it. Caveat emptor. Cheers, Laird Breyer On Aug 27, 11:06 pm, Albert Zeyer <[email protected]> wrote: > Am 27.08.10 11:53, schrieb Jimmy O'Regan:> On 26 August 2010 16:27, > albert<[email protected]> wrote: > >> Hi, > > >> I need an open OCR library which is able to scan complex printed math > >> formulas (for example some formulas which were generated via LaTeX). I > >> want to get some LaTeX-like output (or just some AST-like data). > > >> Can Tesseract do this? Is there something like this already? Or are > >> current OCR technics just able to parse line-oriented text? > > Tesseract does not do that. There's an open enhancement request that > > might have more information: > >http://code.google.com/p/tesseract-ocr/issues/detail?id=270 > > Ah, but I am asking for more than just be able to scan math symbols. I > want to have support to scan full formulas which can be quite complex. A > combination of \frac, \int, \sum, etc. It must not only detect the > symbols, it must also see how they belong together (for example the > numerator and the denominator in a fraction). > > Is it possible to extend Tesseract to be able to do this or is some > heavy redesign of the whole engine needed (and some fundamental other > technics) to do this? > > // -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

