Thanks for the response. I am still not sure about some aspects of it. I just found out that the UIMA framework has this following DictionaryAnnotator feature: http://svn.apache.org/repos/asf/incubator/uima/sandbox/trunk/DictionaryAnnotator/doc/pdf/DictionaryAnnotatorUserGuide.pdf
This is similar to what the ConceptMapper doing. Is there any advantage over the DictionaryAnnotator? Thank you! Ahmed On Wed, Jun 18, 2008 at 10:23 AM, Michael Tanenblatt < [EMAIL PROTECTED]> wrote: > My original message regarding this talks some about the dictionary format. > I am in the process o writing a paper describing the whole of ConceptMapper, > but that is not yet done. Here is what I wrote before: > > The structure of the dictionary itself is quite flexible. Entries can have >> any number of variants (synonyms), and arbitrary features can be >> associated >> with dictionary entries. Individual variants inherit features from parent >> token (i.e., the canonical from), but can override them or add additional >> features. In the following sample dictionary entry, there are 5 variants >> of >> the canonical form, and as described earlier, each inherits the SemClass >> and POS attributes from the canonical form, with the exception of the >> variant "mesenteric fibromatosis (c48.1)", which overrides the value of >> the >> SemClass attribute (this is somewhat of a contrived example, just to make >> that point): >> <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN"> >> <variant base="abdominal fibromatosis" /> >> <variant base="abdominal desmoid" /> >> <variant base="mesenteric fibromatosis (c48.1)" >> SemClass="Diagnosis-Site" /> >> <variant base="mesenteric fibromatosis" /> >> <variant base="retroperitoneal fibromatosis" /> >> </token> >> > > So, testDict.xml is just an example. Two key AE descriptor parameters are > "AttributeList" and "FeatureList", which provide the means to map from the > XML attributes to the target annotation features. If your target annotation > were called "DictTerm" and the DictTerm had the features "canonicalForm", > "semanticClass" and "partOfSpeechTag", using the example dictionary snippet > shown above, you would set AttributeList to: > > DictCanon > SemClass > POS > > and you would set FeatureList to: > > canonicalForm > semanticClass > partOfSpeechTag > > then, when one of the variants is matched in the text, a new DictTerm would > be created with its semanticClass set to the value of the SemClass attribute > and its partOfSpeechTag set to the value of the POS attribute. > > One important point: matches are only performed against the strings listed > as attributes to the "variant" tag's "base" attribute. It is common practice > to have something like the "token" element with something like a canonical > form that is the same as one of the variants, but that is not required. > > I hope this helps! > > > > On Jun 18, 2008, at 10:06 AM, Ahmed Abdeen Hamed wrote: > > Thank Michael! I only recently joined the list so I missed the early >> posting. I like this example a lot. I was able to get it to run using the >> document analyzer from the uimaj-example. I have some questions though: >> Is the testDict.xml just an arbitrary xml file which means any well-formed >> xml file should work? How do I get my own xml dictionary files to work >> without transforming them into the xml format in your testDict.xml file? >> Is >> there documentation for this so that I can understand it on my own without >> bugging the entire list?Thanks! >> Ahmed >> >> On Tue, Jun 17, 2008 at 8:05 PM, Michael Tanenblatt < >> [EMAIL PROTECTED]> >> wrote: >> >> As Thilo mentioned in an email from May 19, 2008, I forgot to include the >>> source for uima.tt.TokenAnnotation, but otherwise the code should be >>> fine. >>> >>> Additionally, the problem you are seeing is with OffsetTokenizer, which >>> is >>> just a sample tokenizer--if you have another, more robust tokenizer, you >>> don't need this OffsetTokenizer. >>> >>> >>> >
