Re: Any interest in this as an open source project?

Michael Baessler Fri, 09 May 2008 23:55:19 -0700

Hi Michael,

thanks for the detailed comparison. ConceptMapper seems to be very interesting
but I have some additional questions. Please see my comments below:


Michael Tanenblatt wrote:
> OK, good question. I have never used the project that is in the sandbox
> as ConceptMapper has been in development and production for a long time,
> so my comparisons are based solely on what I gleaned from the
> documentation. From this cursory knowledge of the DictionaryAnnotator
> that is already in the sandbox, I think that ConceptMapper provides
> significantly more functionality and customizability, while seemingly
> providing all of the functionality of the current DictionaryAnnotator.
> Here is a comparison, with the caveats claimed earlier regarding my
> level of familiarity with the current DictionaryAnnotator:
> 
> Both annotators allow for the use of the same tokenizer in dictionary
> tokenization as is used in the processing pipeline, though in slightly
> different ways (descriptor vs. Pear file). ConceptMapper has no default
> tokenizer, though there is a simple one included in the package.

I think having a default tokenizer is important for the "ease of use" of the
dictionary component. If users just want to use a simple list of words(multi 
words) for processing,
they don't want to setup a separate tokenizer to create the dictionary. Can 
explain
more detailed what a user have to do to tokenize the content.

> 
> One clear difference is that there is no dictionary creator for
> ConceptMapper; instead, you must build the XML file by hand. This is
> due, in part, to the fact that dictionary entries can have arbitrary
> attributes associated with them. This leads to what I think is a serious
> advantage of ConceptMapper: these attributes associated with dictionary
> entries can be copied to the annotations that are created in response to
> a successful lookup. This is very useful for attaching a code from some
> coding scheme (e.g., from a medical lexicon or ontology) or a reference
> to a document in which the term was originally extracted, or any number
> of other features. There is no limit to the number of attributes
> attached to the dictionary entries, and the mapping from them to the
> resultant annotations is configurable in the AE descriptor.

So if I understand you correct, the dictionary XML format is not predefined. 
The XML tags
used to specify the dictionary content are related to the used UIMA type 
system. How do you
check for errors in the dictionary definition?

The resulting annotations are specified in the AE descriptor. So I think you 
have a mapping from
dictionary XML elements/features to UIMA types/features? Is there a default 
mapping?

Can the dictionaries also be language specific?

> 
> ConceptMapper only has provisions for using one dictionary per instance,
> though this is probably a relatively simple thing to augment.
> 
> ConceptMapper dictionaries are implemented as shared resources. It is
> not clear if this is the case for the DictionaryAnnotator in the
> sandbox. One could also create a new implementation of the
> DictionaryResource interface. This was done in the case of the
> CompiledDictionaryResource_impl, which operates on a dictionary that has
> been parsed and then serialized, to allow for quick loading.

The DictionaryAnnotator cannot share dictionaries, since the dictionaries are 
compiled
to internal data structures during initialization of the annotator.

> 
> In addition to the ability to do case-normalized matching, which both
> provide, ConceptMapper provides a mechanism to use a stemmer, which is
> applied to both the dictionary and the input documents.

Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?
> 
> Both systems provide the ability to specify the particular type of
> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
> well as an optional feature within that annotation, with both defaulting
> to the covered text. ConceptMapper also allows an annotation type to be
> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
> etc.). Perhaps this was an oversight on my part, but I did not see this
> in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also allows an
annotation type to be used to bound lookups". Can you give an example?
> 
> Token skipping is an option in both systems, though it is implemented
> differently. ConceptMapper includes has two methods available: the
> ability to use a stop-word list to handle the simple case of omitting
> tokens based in lexical equality, and feature-based include/exclude
> lists. The latter is not as general as I'd like in its implementation.
> Perhaps the filter conditions of the current DictionaryAnnotator is better.
> 
> Finally, and again this may be due an oversight on my part in reading
> the documentation, it is not clear what the search strategy is for the
> current DictionaryAnnotator, but I would assume it finds non-overlapping
> longest matches. While ConceptMapper supports this as a default, there
> are three parameters in the AE descriptor to control the way the search
> is done. 

Right, you cannot configure the matching strategy for the DictionaryAnnotator.
Currently the matching strategy is "first longest match" and no "overlapping"
annotations are created. So you are right non-overlapping longest matches.


Altogether, I see advantages for both system. I'm not sure if there is a way to
create one Dictionary component with the advantages of both since some of the
base concepts are different e.g. dictionary content object. But maybe we can 
try :-)

-- Michael

Re: Any interest in this as an open source project?

Reply via email to