Im using the UMLS ontology modeled on 2ngrams. Im also doing a state space search generated from the unicharamig with goals being defined by that set. So im right there with you Tom!
To anyone:Im assuming tesseract makes the replacements defined in unicharambig if they have the mandatory flag, but what about those flagged non mandatory? I couldnt find in the man pages the criteria for when tesseract makes these replacements if at all. Thanks to anyone who considers this, John — Sent from Mailbox for iPhone On Tue, Mar 18, 2014 at 1:47 PM, Tom Morris <[email protected]> wrote: > On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote: >> >> >> *What I'm doing: *As part of a longer pipeline, at one step I am >> reasoning over very small but highly characteristic strings like drug >> dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, >> even unigrams, only do a so-so job. I'd like to calculate probabilities >> based on look-alikes per above. That is, a not unreasonable case on a poor >> document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, >> for example. But, if the program knew that 0 and O as well as m and n can >> be frequently mistaken for the same character ... better matching. I've >> also considered dumping individual character probabilities into the mix >> from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, >> and I'm not even convinced that this would be a better solution. >> > It's not clear from your description if you're already doing this, but you > might want to consider modeling the target domain that you're matching to > either in terms of n-gram probabilities or something even stricter. > There's going to be much less variability in something like a dosage > string than there is in general text. You could use something like a > medical term ontology to create a pretty comprehensive list of things like > units, frequencies, routes, etc. > Tom > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > --- > You received this message because you are subscribed to a topic in the Google > Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/iH79rOniEtM/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/d/optout. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

