Re: Individual character variation lists

John Green Wed, 02 Apr 2014 06:12:07 -0700

Im using the UMLS ontology modeled on 2ngrams. Im also doing a state space 
search generated from the unicharamig with goals being defined by that set. So 
im right there with you  Tom!





To anyone:Im assuming tesseract makes the replacements defined in unicharambig 
if they have the mandatory flag, but what about those flagged non mandatory? I 
couldnt find in the man pages the criteria for when tesseract makes these 
replacements if at all.




Thanks to anyone who considers this,

John 

—
Sent from Mailbox for iPhone

On Tue, Mar 18, 2014 at 1:47 PM, Tom Morris <[email protected]> wrote:

> On Wednesday, March 12, 2014 7:57:38 AM UTC-4, John Green wrote:
>>
>>
>> *What I'm doing: *As part of a longer pipeline, at one step I am 
>> reasoning over very small but highly characteristic strings like drug 
>> dosage, "60 mg". Edit distance (Levenshtein or a variation) and n-grams, 
>> even unigrams, only do a so-so job. I'd like to calculate probabilities 
>> based on look-alikes per above. That is, a not unreasonable case on a poor 
>> document is to mistake "60 mg" for 6Ong" which gives a ratio of only 44%, 
>> for example. But, if the program knew that 0 and O as well as m and n can 
>> be frequently mistaken for the same character ... better matching. I've 
>> also considered dumping individual character probabilities into the mix 
>> from Tesseracts API, but I'm new to Tesseract, haven't gotten there yet, 
>> and I'm not even convinced that this would be a better solution. 
>>
> It's not clear from your description if you're already doing this, but you 
> might want to consider modeling the target domain that you're matching to 
> either in terms of n-gram probabilities or something even stricter. 
>  There's going to be much less variability in something like a dosage 
> string than there is in general text.  You could use something like a 
> medical term ontology to create a pretty comprehensive list of things like 
> units, frequencies, routes, etc.
> Tom 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
> --- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/tesseract-ocr/iH79rOniEtM/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Individual character variation lists

Reply via email to