Hi, I came across this conversation regarding formulae. May I ask whether you have made any progress?
I need a solution for this as well. Am glad to cooperate in testing etc. Greetings, Leo Hamminger Am Dienstag, 16. Dezember 2008 23:34:21 UTC+1 schrieb Ray Smith: > > You would need to cut out most of the code in the textord directory, and > just run the classifier directly on the blobs, with the baseline correction > feature disabled. > > This means: > bypass filter_blobs and textord_page in edges_and_textord, making fake > words and text lines from individiual blobs. The code in applybox.cpp might > give you some idea of how to do this. > Set IntegerMatcherMultiplier to zero. > > Ray. > > On Fri, Dec 12, 2008 at 1:12 AM, lab <la...@lbreyer.com <javascript:>> > wrote: > >> >> Ray, >> >> can you explain what you mean by skipping text line and word finding, >> ie how to enable or disable this correctly in tesseract? >> >> I've had mixed results with the standard tesseract 2.03 (debian, >> default options) on mathematical documents. Most sentences with simple >> formulas or isolated mathematical symbols can be read reasonably well >> after training some sample pages, but displayed equations and formulas >> (ie on their own line(s)) are usually completely garbled. Moderately >> simple symbols with both a superscript and a subscript cannot usually >> be recognized at all. Also, having both superscripts and subscripts >> somewhere in a single formula can confuse tesseract so that it thinks >> the superscript belongs to the previous line or an "extra" line in >> between. I've also observed that sometimes, the same symbol can be >> recognized easily when it occurs in a subscript position, but is often >> mistaken when it occurs in a superscript position. >> >> lab. >> >> On Dec 12, 8:51 am, "Ray Smith" <theraysm...@gmail.com> wrote: >> > This problem has not been attempted before with tesseract. >> > The biggest thing to watch out for is to skip the text line and word >> > finding. You might have significant success just running the classifier >> on >> > the connected components. >> > Training might be a bit tricky too, since it relies on the text line >> finder. >> > Ray. >> > >> > Sent from my G1 Android Phone. >> > >> > On Dec 10, 2008 10:45 PM, "jean" <jean.f...@gmail.com> wrote: >> > >> > Hi, >> > >> > I'm interested in developing an OCR to read math formulas using >> > tesseract as my platform. I have been trying to use tesseract to read >> > LATEX image files. I have tried reading the squareroot of x+2, and >> > tesseract read it as vx+2. For the sqrt(x+ sqrt(2)), tesseract sees >> > J@. No big surprise since tesseract wasn't made for understanding the >> > recursive nature of math formulas. >> > >> > So my question is what progress has been made on a tesseract-based >> > math-OCR? And would there be any things I need to watch out for? >> > >> > --Jean >> >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6a4a328b-7659-4264-9d72-b45fceeff20e%40googlegroups.com.