Re: Any interest in this as an open source project?

Michael Tanenblatt Fri, 09 May 2008 09:47:10 -0700

OK, good question. I have never used the project that is in thesandbox as ConceptMapper has been in development and production for along time, so my comparisons are based solely on what I gleaned fromthe documentation. From this cursory knowledge of theDictionaryAnnotator that is already in the sandbox, I think thatConceptMapper provides significantly more functionality andcustomizability, while seemingly providing all of the functionality ofthe current DictionaryAnnotator. Here is a comparison, with thecaveats claimed earlier regarding my level of familiarity with thecurrent DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionarytokenization as is used in the processing pipeline, though in slightlydifferent ways (descriptor vs. Pear file). ConceptMapper has nodefault tokenizer, though there is a simple one included in the package.

One clear difference is that there is no dictionary creator forConceptMapper; instead, you must build the XML file by hand. This isdue, in part, to the fact that dictionary entries can have arbitraryattributes associated with them. This leads to what I think is aserious advantage of ConceptMapper: these attributes associated withdictionary entries can be copied to the annotations that are createdin response to a successful lookup. This is very useful for attachinga code from some coding scheme (e.g., from a medical lexicon orontology) or a reference to a document in which the term wasoriginally extracted, or any number of other features. There is nolimit to the number of attributes attached to the dictionary entries,and the mapping from them to the resultant annotations is configurablein the AE descriptor.

ConceptMapper only has provisions for using one dictionary perinstance, though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It isnot clear if this is the case for the DictionaryAnnotator in thesandbox. One could also create a new implementation of theDictionaryResource interface. This was done in the case of theCompiledDictionaryResource_impl, which operates on a dictionary thathas been parsed and then serialized, to allow for quick loading.

In addition to the ability to do case-normalized matching, which bothprovide, ConceptMapper provides a mechanism to use a stemmer, which isapplied to both the dictionary and the input documents.

Both systems provide the ability to specify the particular type ofannotation to consider in lookups (e.g., uima.tt.TokenAnnotation), aswell as an optional feature within that annotation, with bothdefaulting to the covered text. ConceptMapper also allows anannotation type to be used to bound lookups (e.g. a sentence at atime, or an NP at a time, etc.). Perhaps this was an oversight on mypart, but I did not see this in the existing sandbox annotator.

Token skipping is an option in both systems, though it is implementeddifferently. ConceptMapper includes has two methods available: theability to use a stop-word list to handle the simple case of omittingtokens based in lexical equality, and feature-based include/excludelists. The latter is not as general as I'd like in its implementation.Perhaps the filter conditions of the current DictionaryAnnotator isbetter.

Finally, and again this may be due an oversight on my part in readingthe documentation, it is not clear what the search strategy is for thecurrent DictionaryAnnotator, but I would assume it finds non-overlapping longest matches. While ConceptMapper supports this as adefault, there are three parameters in the AE descriptor to controlthe way the search is done. From my original email:

Dictionary lookup is controlled by three parameters in thedescriptor, oneof which allows for order-independent lookup (i.e., A B == B A),another
togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of whichthereare three. The default search strategy only considers contiguoustokens(not including tokens frm the stop word list or otherwise skippedtokens),and then begins the subsequent search after the longest match. Thesecondstrategy allows for ignoring non-matching tokens, allowing fordisjoint
matches, so that a dictionary entry of

    A C

would match against the text

    A B C
As with the default search strategy, the subsequent search beginsafter the
longest match. The final search strategy is identical to the previous,
except that subsequent searches begin one token ahead, instead ofafter the
previous match. This enables overlapped matching.



On May 9, 2008, at 5:13 AM, Thilo Goetz wrote:

Michael A Tanenblatt wrote:

My group would like to offer the following UIMA component,ConceptMapper,
as an open source offering into the UIMA sandbox, assuming there is
interest from the community:

...

Michael,

we already have a dictionay project in the sandbox.  Can you
comment on what the differences are, why you think we need
another one?  Another option would be for you to help extending
the existing dictionary implementation to satisfy your needs.

--Thilo

Re: Any interest in this as an open source project?

Reply via email to