OK, good question. I have never used the project that is in the sandbox as ConceptMapper has been in development and production for a long time, so my comparisons are based solely on what I gleaned from the documentation. From this cursory knowledge of the DictionaryAnnotator that is already in the sandbox, I think that ConceptMapper provides significantly more functionality and customizability, while seemingly providing all of the functionality of the current DictionaryAnnotator. Here is a comparison, with the caveats claimed earlier regarding my level of familiarity with the current DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionary tokenization as is used in the processing pipeline, though in slightly different ways (descriptor vs. Pear file). ConceptMapper has no default tokenizer, though there is a simple one included in the package.

One clear difference is that there is no dictionary creator for ConceptMapper; instead, you must build the XML file by hand. This is due, in part, to the fact that dictionary entries can have arbitrary attributes associated with them. This leads to what I think is a serious advantage of ConceptMapper: these attributes associated with dictionary entries can be copied to the annotations that are created in response to a successful lookup. This is very useful for attaching a code from some coding scheme (e.g., from a medical lexicon or ontology) or a reference to a document in which the term was originally extracted, or any number of other features. There is no limit to the number of attributes attached to the dictionary entries, and the mapping from them to the resultant annotations is configurable in the AE descriptor.

ConceptMapper only has provisions for using one dictionary per instance, though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It is not clear if this is the case for the DictionaryAnnotator in the sandbox. One could also create a new implementation of the DictionaryResource interface. This was done in the case of the CompiledDictionaryResource_impl, which operates on a dictionary that has been parsed and then serialized, to allow for quick loading.

In addition to the ability to do case-normalized matching, which both provide, ConceptMapper provides a mechanism to use a stemmer, which is applied to both the dictionary and the input documents.

Both systems provide the ability to specify the particular type of annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as well as an optional feature within that annotation, with both defaulting to the covered text. ConceptMapper also allows an annotation type to be used to bound lookups (e.g. a sentence at a time, or an NP at a time, etc.). Perhaps this was an oversight on my part, but I did not see this in the existing sandbox annotator.

Token skipping is an option in both systems, though it is implemented differently. ConceptMapper includes has two methods available: the ability to use a stop-word list to handle the simple case of omitting tokens based in lexical equality, and feature-based include/exclude lists. The latter is not as general as I'd like in its implementation. Perhaps the filter conditions of the current DictionaryAnnotator is better.

Finally, and again this may be due an oversight on my part in reading the documentation, it is not clear what the search strategy is for the current DictionaryAnnotator, but I would assume it finds non- overlapping longest matches. While ConceptMapper supports this as a default, there are three parameters in the AE descriptor to control the way the search is done. From my original email:

Dictionary lookup is controlled by three parameters in the descriptor, one of which allows for order-independent lookup (i.e., A B == B A), another
togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of which there are three. The default search strategy only considers contiguous tokens (not including tokens frm the stop word list or otherwise skipped tokens), and then begins the subsequent search after the longest match. The second strategy allows for ignoring non-matching tokens, allowing for disjoint
matches, so that a dictionary entry of

    A C

would match against the text

    A B C

As with the default search strategy, the subsequent search begins after the
longest match. The final search strategy is identical to the previous,
except that subsequent searches begin one token ahead, instead of after the
previous match. This enables overlapped matching.



On May 9, 2008, at 5:13 AM, Thilo Goetz wrote:

Michael A Tanenblatt wrote:
My group would like to offer the following UIMA component, ConceptMapper,
as an open source offering into the UIMA sandbox, assuming there is
interest from the community:
...

Michael,

we already have a dictionay project in the sandbox.  Can you
comment on what the differences are, why you think we need
another one?  Another option would be for you to help extending
the existing dictionary implementation to satisfy your needs.

--Thilo


Reply via email to