OK, good question. I have never used the project that is in the
sandbox as ConceptMapper has been in development and production for a
long time, so my comparisons are based solely on what I gleaned from
the documentation. From this cursory knowledge of the
DictionaryAnnotator that is already in the sandbox, I think that
ConceptMapper provides significantly more functionality and
customizability, while seemingly providing all of the functionality of
the current DictionaryAnnotator. Here is a comparison, with the
caveats claimed earlier regarding my level of familiarity with the
current DictionaryAnnotator:
Both annotators allow for the use of the same tokenizer in dictionary
tokenization as is used in the processing pipeline, though in slightly
different ways (descriptor vs. Pear file). ConceptMapper has no
default tokenizer, though there is a simple one included in the package.
One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. This is
due, in part, to the fact that dictionary entries can have arbitrary
attributes associated with them. This leads to what I think is a
serious advantage of ConceptMapper: these attributes associated with
dictionary entries can be copied to the annotations that are created
in response to a successful lookup. This is very useful for attaching
a code from some coding scheme (e.g., from a medical lexicon or
ontology) or a reference to a document in which the term was
originally extracted, or any number of other features. There is no
limit to the number of attributes attached to the dictionary entries,
and the mapping from them to the resultant annotations is configurable
in the AE descriptor.
ConceptMapper only has provisions for using one dictionary per
instance, though this is probably a relatively simple thing to augment.
ConceptMapper dictionaries are implemented as shared resources. It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionary that
has been parsed and then serialized, to allow for quick loading.
In addition to the ability to do case-normalized matching, which both
provide, ConceptMapper provides a mechanism to use a stemmer, which is
applied to both the dictionary and the input documents.
Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
well as an optional feature within that annotation, with both
defaulting to the covered text. ConceptMapper also allows an
annotation type to be used to bound lookups (e.g. a sentence at a
time, or an NP at a time, etc.). Perhaps this was an oversight on my
part, but I did not see this in the existing sandbox annotator.
Token skipping is an option in both systems, though it is implemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case of omitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in its implementation.
Perhaps the filter conditions of the current DictionaryAnnotator is
better.
Finally, and again this may be due an oversight on my part in reading
the documentation, it is not clear what the search strategy is for the
current DictionaryAnnotator, but I would assume it finds non-
overlapping longest matches. While ConceptMapper supports this as a
default, there are three parameters in the AE descriptor to control
the way the search is done. From my original email:
Dictionary lookup is controlled by three parameters in the
descriptor, one
of which allows for order-independent lookup (i.e., A B == B A),
another
togles between finding only the longest match vs. finding all possible
matches. The final parameter specifies the search strategy, of which
there
are three. The default search strategy only considers contiguous
tokens
(not including tokens frm the stop word list or otherwise skipped
tokens),
and then begins the subsequent search after the longest match. The
second
strategy allows for ignoring non-matching tokens, allowing for
disjoint
matches, so that a dictionary entry of
A C
would match against the text
A B C
As with the default search strategy, the subsequent search begins
after the
longest match. The final search strategy is identical to the previous,
except that subsequent searches begin one token ahead, instead of
after the
previous match. This enables overlapped matching.
On May 9, 2008, at 5:13 AM, Thilo Goetz wrote:
Michael A Tanenblatt wrote:
My group would like to offer the following UIMA component,
ConceptMapper,
as an open source offering into the UIMA sandbox, assuming there is
interest from the community:
...
Michael,
we already have a dictionay project in the sandbox. Can you
comment on what the differences are, why you think we need
another one? Another option would be for you to help extending
the existing dictionary implementation to satisfy your needs.
--Thilo