On 29 August 2010 00:46, Albert Zeyer <[email protected]> wrote: > Jimmy O'Regan wrote: >> As it is, I pointed you to the enhancement request, which, as you seem >> to not have read it, has some - admittedly, not much - extra >> information on the topic. > > Ah sorry, I missunderstood the request. The description of it is just about > the symbols that is why I thought this request is about symbols. The > original poster only added in an additional comment that this request may be > extended to full formula recognition -- which is in my eyes a very different > request, so it would have fit better into another, separated request. > > Also, despite the link to the Inftyproject and a comment that it is not open > source, the rest of the discussion is just about symbol recognition (and > esp. about how Detexify works). >
Well, like I said, not much information, but the issue tracker is a good place to keep notes, and there is a link from the Infty site to papers on various aspects of their system. >>> Is it possible to extend Tesseract to be able to do this or is some heavy >>> redesign of the whole engine needed (and some fundamental other technics) >>> to >>> do this? >> >> The only current system available for maths recognition - the link is >> in the enhancement request - contains its maths recognition as a >> separate engine. I don't think that's strictly necessary, but maths >> would need to be processed in an entirely different way, and a formula >> detection mechanism would be required to ensure it is handled in a >> different way. At the very least, the formula would need to be >> segmented into a grid, because relative position and size is much more >> significant than in text - not just in detecting >> superscripts/subscripts, but also in determining if pi means pi or >> product, etc. > > Thanks for this evalutation. > > I will see what I can do. Maybe I will try to play around with this a bit > myself. It was anyway just for a small side project for me so I am not sure > yet how much time I want to invest into this. I will let you know if I have > something interesting for you. The only paper I've read on formula detection was rather dated, and amounted basically to looking for a distribution of numbers, individual letters, and math-like symbols (and that would erroneously consider everything that looks like '(1)' to be maths). Look at the tab detection code - if you have a bunch of math-like symbols *and* a set of vertically aligned equals signs, then you probably have a proof; if you have large [] containing aligned math-likes, you probably have a matrix. You'd have something that's surely more reliable than checking for a bunch of random numbers and letters, plus you can turn a large task into a set of small tasks. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

