My comments inline, below:

On May 10, 2008, at 2:56 AM, Michael Baessler wrote:

Hi Michael,

thanks for the detailed comparison. ConceptMapper seems to be very interesting
but I have some additional questions. Please see my comments below:

Michael Tanenblatt wrote:
OK, good question. I have never used the project that is in the sandbox as ConceptMapper has been in development and production for a long time,
so my comparisons are based solely on what I gleaned from the
documentation. From this cursory knowledge of the DictionaryAnnotator
that is already in the sandbox, I think that ConceptMapper provides
significantly more functionality and customizability, while seemingly
providing all of the functionality of the current DictionaryAnnotator.
Here is a comparison, with the caveats claimed earlier regarding my
level of familiarity with the current DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionary
tokenization as is used in the processing pipeline, though in slightly different ways (descriptor vs. Pear file). ConceptMapper has no default
tokenizer, though there is a simple one included in the package.

I think having a default tokenizer is important for the "ease of use" of the dictionary component. If users just want to use a simple list of words(multi words) for processing, they don't want to setup a separate tokenizer to create the dictionary. Can explain
more detailed what a user have to do to tokenize the content.

I am not sure I agree with you on this point. Since the default setup for ConceptMapper is to tokenize the dictionary at load time, which is when the processing pipeline is set up, and since there will need to be a tokenizer in the pipeline for processing the the input text, I think that it is perfectly reasonable to require the specification of a tokenizer in the ConceptMapper AE descriptor for use as the tokenizer for the dictionary. This enforces the point that the same tokenizer be used for tokenizing the dictionary as the input data, which is actually something that I believe to be more than reasonable. In fact, I think it *should* be a requirement, otherwise entries that are in the dictionary might not be found, due to a tokenization mismatch.

To simplify setup for naïve users, as I said, there is a simple tokenizer annotator included in the ConceptMapper package, and that could be used for both the dictionary and text processing. It breaks on whitespace, plus any other character specified in a parameter.




One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. This is
due, in part, to the fact that dictionary entries can have arbitrary
attributes associated with them. This leads to what I think is a serious advantage of ConceptMapper: these attributes associated with dictionary entries can be copied to the annotations that are created in response to a successful lookup. This is very useful for attaching a code from some coding scheme (e.g., from a medical lexicon or ontology) or a reference to a document in which the term was originally extracted, or any number
of other features. There is no limit to the number of attributes
attached to the dictionary entries, and the mapping from them to the
resultant annotations is configurable in the AE descriptor.

So if I understand you correct, the dictionary XML format is not predefined. The XML tags used to specify the dictionary content are related to the used UIMA type system. How do you
check for errors in the dictionary definition?

The predefined portion of the XML is:

<token>
        <variant base="text string" />
        <variant base="text string2" />
        ...
</token>

which defines an entry with two variants. It is any additional attributes that you might want to add (POS tag, code, etc.) that is not predefined, but also not required. The only error checking would be that the SAX parser would throw an exception if the above structure is not intact.



The resulting annotations are specified in the AE descriptor. So I think you have a mapping from dictionary XML elements/features to UIMA types/features? Is there a default mapping?


There is no default mapping. Any identifying attributes that you might want transfered to the resultant annotations can be put into the token element or the individual variant elements, but as I said, that is optional.


Can the dictionaries also be language specific?

Well, I am not sure what this means. If you mean: will ConceptMapper load different dictionaries depending on the language setting, then the answer is no. It currently allows only one dictionary resource to be specified, and it will be loaded if necessary. I would agree that this would be a nice feature to incorporate, though.




ConceptMapper only has provisions for using one dictionary per instance,
though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionary that has
been parsed and then serialized, to allow for quick loading.

The DictionaryAnnotator cannot share dictionaries, since the dictionaries are compiled
to internal data structures during initialization of the annotator.

The same is true of ConceptMapper, so I am not sure how useful it is that it is an UIMA resource. Nevertheless, it is one, and other instantiations of ConceptMapper could attach to that resource, if needed.




In addition to the ability to do case-normalized matching, which both
provide, ConceptMapper provides a mechanism to use a stemmer, which is
applied to both the dictionary and the input documents.

Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?

None is provided. To adapt one for use, it needs to adhere to a simple interface:

public interface Stemmer {
        public String stem(String token);
public void initialize(String dictionary) throws FileNotFoundException, ParseException;
}

The only method that has to do anything is stem(), which takes a string in and returns a string. Using this, it was quite simple to integrate the open source Snowball stemmer.




Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
well as an optional feature within that annotation, with both defaulting to the covered text. ConceptMapper also allows an annotation type to be
used to bound lookups (e.g. a sentence at a time, or an NP at a time,
etc.). Perhaps this was an oversight on my part, but I did not see this
in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also allows an
annotation type to be used to bound lookups". Can you give an example?

What I mean is that ConceptMapper works span by span, and that span is specified in the descriptor. Typically, that span is a sentence, but could be an NP or even the whole document. Dictionary lookups are limited to tokens that appear within a single span--no crossing of span boundaries are allowed. Does this make sense?



Token skipping is an option in both systems, though it is implemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case of omitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in its implementation. Perhaps the filter conditions of the current DictionaryAnnotator is better.

Finally, and again this may be due an oversight on my part in reading
the documentation, it is not clear what the search strategy is for the current DictionaryAnnotator, but I would assume it finds non- overlapping longest matches. While ConceptMapper supports this as a default, there are three parameters in the AE descriptor to control the way the search
is done.

Right, you cannot configure the matching strategy for the DictionaryAnnotator. Currently the matching strategy is "first longest match" and no "overlapping" annotations are created. So you are right non-overlapping longest matches.


Altogether, I see advantages for both system. I'm not sure if there is a way to create one Dictionary component with the advantages of both since some of the base concepts are different e.g. dictionary content object. But maybe we can try :-)

-- Michael

Reply via email to