On 29 August 2010 00:46, Albert Zeyer <[email protected]> wrote:
> Jimmy O'Regan wrote:
>> As it is, I pointed you to the enhancement request, which, as you seem
>> to not have read it, has some - admittedly, not much - extra
>> information on the topic.
>
> Ah sorry, I missunderstood the request. The description of it is just about
> the symbols that is why I thought this request is about symbols. The
> original poster only added in an additional comment that this request may be
> extended to full formula recognition -- which is in my eyes a very different
> request, so it would have fit better into another, separated request.
>
> Also, despite the link to the Inftyproject and a comment that it is not open
> source, the rest of the discussion is just about symbol recognition (and
> esp. about how Detexify works).
>

Well, like I said, not much information, but the issue tracker is a
good place to keep notes, and there is a link from the Infty site to
papers on various aspects of their system.

>>> Is it possible to extend Tesseract to be able to do this or is some heavy
>>> redesign of the whole engine needed (and some fundamental other technics)
>>> to
>>> do this?
>>
>> The only current system available for maths recognition - the link is
>> in the enhancement request - contains its maths recognition as a
>> separate engine. I don't think that's strictly necessary, but maths
>> would need to be processed in an entirely different way, and a formula
>> detection mechanism would be required to ensure it is handled in a
>> different way. At the very least, the formula would need to be
>> segmented into a grid, because relative position and size is much more
>> significant than in text - not just in detecting
>> superscripts/subscripts, but also in determining if pi means pi or
>> product, etc.
>
> Thanks for this evalutation.
>
> I will see what I can do. Maybe I will try to play around with this a bit
> myself. It was anyway just for a small side project for me so I am not sure
> yet how much time I want to invest into this. I will let you know if I have
> something interesting for you.

The only paper I've read on formula detection was rather dated, and
amounted basically to looking for a distribution of numbers,
individual letters, and math-like symbols (and that would erroneously
consider everything that looks like '(1)' to be maths).

Look at the tab detection code - if you have a bunch of math-like
symbols *and* a set of vertically aligned equals signs, then you
probably have a proof; if you have large [] containing aligned
math-likes, you probably have a matrix. You'd have something that's
surely more reliable than checking for a bunch of random numbers and
letters, plus you can turn a large task into a set of small tasks.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to