Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Ahmed Abdeen Hamed Wed, 18 Jun 2008 09:34:07 -0700

Thanks for the response. I am still not sure about some aspects of it. I
just found out that the UIMA framework has this following
DictionaryAnnotator feature:
http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator/doc/pdf/DictionaryAnnotatorUserGuide.pdf


This is similar to what the ConceptMapper doing. Is there any advantage over
the DictionaryAnnotator?

Thank you!
Ahmed

On Wed, Jun 18, 2008 at 10:23 AM, Michael Tanenblatt <
[EMAIL PROTECTED]> wrote:

> My original message regarding this talks some about the dictionary format.
> I am in the process o writing a paper describing the whole of ConceptMapper,
> but that is not yet done. Here is what I wrote before:
>
>  The structure of the dictionary itself is quite flexible. Entries can have
>> any number of variants (synonyms), and arbitrary features can be
>> associated
>> with dictionary entries. Individual variants inherit features from parent
>> token (i.e., the canonical from), but can override them or add additional
>> features. In the following sample dictionary entry, there are 5 variants
>> of
>> the canonical form, and as described earlier, each inherits the SemClass
>> and POS attributes from the canonical form, with the exception of the
>> variant "mesenteric fibromatosis (c48.1)", which overrides the value of
>> the
>> SemClass attribute (this is somewhat of a contrived example, just to make
>> that point):
>> <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
>>  <variant base="abdominal fibromatosis" />
>>  <variant base="abdominal desmoid" />
>>  <variant base="mesenteric fibromatosis (c48.1)"
>> SemClass="Diagnosis-Site" />
>>  <variant base="mesenteric fibromatosis" />
>>  <variant base="retroperitoneal fibromatosis" />
>> </token>
>>
>
> So, testDict.xml is just an example. Two key AE descriptor parameters are
> "AttributeList" and "FeatureList", which provide the means to map from the
> XML attributes to the target annotation features. If your target annotation
> were called "DictTerm" and the DictTerm had the features "canonicalForm",
> "semanticClass" and "partOfSpeechTag", using the example dictionary snippet
> shown above, you would set AttributeList to:
>
>        DictCanon
>        SemClass
>        POS
>
> and you would set FeatureList to:
>
>        canonicalForm
>        semanticClass
>        partOfSpeechTag
>
> then, when one of the variants is matched in the text, a new DictTerm would
> be created with its semanticClass set to the value of the SemClass attribute
> and its partOfSpeechTag set to the value of the POS attribute.
>
> One important point: matches are only performed against the strings listed
> as attributes to the "variant" tag's "base" attribute. It is common practice
> to have something like the "token" element with something like a canonical
> form that is the same as one of the variants, but that is not required.
>
> I hope this helps!
>
>
>
> On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote:
>
>  Thank Michael! I only recently joined the list so I missed the early
>> posting. I like this example a lot. I was able to get it to run using the
>> document analyzer from the uimaj-example. I have some questions though:
>> Is the testDict.xml just an arbitrary xml file which means any well-formed
>> xml file should work? How do I get my own xml dictionary files to work
>> without transforming them into the xml format in your testDict.xml file?
>> Is
>> there documentation for this so that I can understand it on my own without
>> bugging the entire list?Thanks!
>> Ahmed
>>
>> On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt <
>> [EMAIL PROTECTED]>
>> wrote:
>>
>>  As Thilo mentioned in an email from May 19, 2008, I forgot to include the
>>> source for uima.tt.TokenAnnotation, but otherwise the code should be
>>> fine.
>>>
>>> Additionally, the problem you are seeing is with OffsetTokenizer, which
>>> is
>>> just a sample tokenizer--if you have another, more robust tokenizer, you
>>> don't need this OffsetTokenizer.
>>>
>>>
>>>
>

Re: [jira] Commented: (UIMA-1033) ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component

Reply via email to