Re: Any interest in this as an open source project?

Michael Tanenblatt Sat, 10 May 2008 04:20:17 -0700

My comments inline, below:

On May 10, 2008, at 2:56 AM, Michael Baessler wrote:

Hi Michael,
thanks for the detailed comparison. ConceptMapper seems to be veryinteresting
but I have some additional questions. Please see my comments below:

Michael Tanenblatt wrote:
OK, good question. I have never used the project that is in thesandboxas ConceptMapper has been in development and production for a longtime,
so my comparisons are based solely on what I gleaned from the
documentation. From this cursory knowledge of the DictionaryAnnotator
that is already in the sandbox, I think that ConceptMapper provides
significantly more functionality and customizability, while seemingly
providing all of the functionality of the currentDictionaryAnnotator.
Here is a comparison, with the caveats claimed earlier regarding my
level of familiarity with the current DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionary
tokenization as is used in the processing pipeline, though inslightlydifferent ways (descriptor vs. Pear file). ConceptMapper has nodefault
tokenizer, though there is a simple one included in the package.
I think having a default tokenizer is important for the "ease ofuse" of thedictionary component. If users just want to use a simple list ofwords(multi words) for processing,they don't want to setup a separate tokenizer to create thedictionary. Can explain
more detailed what a user have to do to tokenize the content.

I am not sure I agree with you on this point. Since the default setupfor ConceptMapper is to tokenize the dictionary at load time, which iswhen the processing pipeline is set up, and since there will need tobe a tokenizer in the pipeline for processing the the input text, Ithink that it is perfectly reasonable to require the specification ofa tokenizer in the ConceptMapper AE descriptor for use as thetokenizer for the dictionary. This enforces the point that the sametokenizer be used for tokenizing the dictionary as the input data,which is actually something that I believe to be more than reasonable.In fact, I think it *should* be a requirement, otherwise entries thatare in the dictionary might not be found, due to a tokenizationmismatch.

To simplify setup for naïve users, as I said, there is a simpletokenizer annotator included in the ConceptMapper package, and thatcould be used for both the dictionary and text processing. It breakson whitespace, plus any other character specified in a parameter.

One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. This is
due, in part, to the fact that dictionary entries can have arbitrary
attributes associated with them. This leads to what I think is aseriousadvantage of ConceptMapper: these attributes associated withdictionaryentries can be copied to the annotations that are created inresponse toa successful lookup. This is very useful for attaching a code fromsomecoding scheme (e.g., from a medical lexicon or ontology) or areferenceto a document in which the term was originally extracted, or anynumber
of other features. There is no limit to the number of attributes
attached to the dictionary entries, and the mapping from them to the
resultant annotations is configurable in the AE descriptor.
So if I understand you correct, the dictionary XML format is notpredefined. The XML tagsused to specify the dictionary content are related to the used UIMAtype system. How do you
check for errors in the dictionary definition?


The predefined portion of the XML is:

<token>
        <variant base="text string" />
        <variant base="text string2" />
        ...
</token>

which defines an entry with two variants. It is any additionalattributes that you might want to add (POS tag, code, etc.) that isnot predefined, but also not required. The only error checking wouldbe that the SAX parser would throw an exception if the above structureis not intact.

The resulting annotations are specified in the AE descriptor. So Ithink you have a mapping fromdictionary XML elements/features to UIMA types/features? Is there adefault mapping?

There is no default mapping. Any identifying attributes that you mightwant transfered to the resultant annotations can be put into the tokenelement or the individual variant elements, but as I said, that isoptional.

Can the dictionaries also be language specific?

Well, I am not sure what this means. If you mean: will ConceptMapperload different dictionaries depending on the language setting, thenthe answer is no. It currently allows only one dictionary resource tobe specified, and it will be loaded if necessary. I would agree thatthis would be a nice feature to incorporate, though.

ConceptMapper only has provisions for using one dictionary perinstance,
though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionarythat has
been parsed and then serialized, to allow for quick loading.
The DictionaryAnnotator cannot share dictionaries, since thedictionaries are compiled
to internal data structures during initialization of the annotator.

The same is true of ConceptMapper, so I am not sure how useful it isthat it is an UIMA resource. Nevertheless, it is one, and otherinstantiations of ConceptMapper could attach to that resource, ifneeded.

In addition to the ability to do case-normalized matching, which both
provide, ConceptMapper provides a mechanism to use a stemmer, whichis
applied to both the dictionary and the input documents.


Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?

None is provided. To adapt one for use, it needs to adhere to a simpleinterface:


public interface Stemmer {
        public String stem(String token);

public void initialize(String dictionary) throwsFileNotFoundException, ParseException;

The only method that has to do anything is stem(), which takes astring in and returns a string. Using this, it was quite simple tointegrate the open source Snowball stemmer.

Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
well as an optional feature within that annotation, with bothdefaultingto the covered text. ConceptMapper also allows an annotation typeto be
used to bound lookups (e.g. a sentence at a time, or an NP at a time,
etc.). Perhaps this was an oversight on my part, but I did not seethis
in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper alsoallows an
annotation type to be used to bound lookups". Can you give an example?

What I mean is that ConceptMapper works span by span, and that span isspecified in the descriptor. Typically, that span is a sentence, butcould be an NP or even the whole document. Dictionary lookups arelimited to tokens that appear within a single span--no crossing ofspan boundaries are allowed. Does this make sense?

Token skipping is an option in both systems, though it is implemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case of omitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in itsimplementation.Perhaps the filter conditions of the current DictionaryAnnotator isbetter.
Finally, and again this may be due an oversight on my part in reading
the documentation, it is not clear what the search strategy is forthecurrent DictionaryAnnotator, but I would assume it finds non-overlappinglongest matches. While ConceptMapper supports this as a default,thereare three parameters in the AE descriptor to control the way thesearch
is done.
Right, you cannot configure the matching strategy for theDictionaryAnnotator.Currently the matching strategy is "first longest match" and no"overlapping"annotations are created. So you are right non-overlapping longestmatches.
Altogether, I see advantages for both system. I'm not sure if thereis a way tocreate one Dictionary component with the advantages of both sincesome of thebase concepts are different e.g. dictionary content object. Butmaybe we can try :-)
-- Michael

Re: Any interest in this as an open source project?

Reply via email to