On May 13, 2008, at 2:31 AM, Michael Baessler wrote:

Michael Tanenblatt wrote:
My comments inline, below:

On May 10, 2008, at 2:56 AM, Michael Baessler wrote:

Hi Michael,

thanks for the detailed comparison. ConceptMapper seems to be very
interesting
but I have some additional questions. Please see my comments below:

Michael Tanenblatt wrote:
OK, good question. I have never used the project that is in the sandbox as ConceptMapper has been in development and production for a long time,
so my comparisons are based solely on what I gleaned from the
documentation. From this cursory knowledge of the DictionaryAnnotator
that is already in the sandbox, I think that ConceptMapper provides
significantly more functionality and customizability, while seemingly providing all of the functionality of the current DictionaryAnnotator.
Here is a comparison, with the caveats claimed earlier regarding my
level of familiarity with the current DictionaryAnnotator:

Both annotators allow for the use of the same tokenizer in dictionary tokenization as is used in the processing pipeline, though in slightly different ways (descriptor vs. Pear file). ConceptMapper has no default
tokenizer, though there is a simple one included in the package.

I think having a default tokenizer is important for the "ease of use"
of the
dictionary component. If users just want to use a simple list of
words(multi words) for processing,
they don't want to setup a separate tokenizer to create the
dictionary. Can explain
more detailed what a user have to do to tokenize the content.

I am not sure I agree with you on this point. Since the default setup
for ConceptMapper is to tokenize the dictionary at load time, which is when the processing pipeline is set up, and since there will need to be a tokenizer in the pipeline for processing the the input text, I think
that it is perfectly reasonable to require the specification of a
tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
for the dictionary. This enforces the point that the same tokenizer be used for tokenizing the dictionary as the input data, which is actually something that I believe to be more than reasonable. In fact, I think it *should* be a requirement, otherwise entries that are in the dictionary
might not be found, due to a tokenization mismatch.

To simplify setup for naïve users, as I said, there is a simple
tokenizer annotator included in the ConceptMapper package, and that
could be used for both the dictionary and text processing. It breaks on
whitespace, plus any other character specified in a parameter.

It was not my intention to say that I disagree with the way how ConceptMapper does the tokenization of the dictionary content. I was not aware that you use one of the tokenizers of the processing pipeline. May I missed that in one of your previous mails.

The way ConceptMapper does the tokenization is fine with me.
Are there any special requirements to the tokenizer (multi threading, resources ...)? I think you create your own instance of the tokenizer during initialization of ConceptMapper.

The tokenizer is created using UIMAFramework.produceAnalysisEngine(). The subsequent analysis engine is run against a dictionary entry's text using its process() method and the tokens created in the CAS as a result of this process are then collected and then the CAS is reset for the next entry to be processed.



Some tokenizers produce different results based on the document language. Is there a setting to specify the language that should be used to tokenize the dictionary content?

There is a AE descriptor parameter that is passed to setDocumentLanguage() for the tokenizer processing of the dictionary entries.






One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. This is due, in part, to the fact that dictionary entries can have arbitrary attributes associated with them. This leads to what I think is a serious advantage of ConceptMapper: these attributes associated with dictionary entries can be copied to the annotations that are created in response to a successful lookup. This is very useful for attaching a code from some coding scheme (e.g., from a medical lexicon or ontology) or a reference to a document in which the term was originally extracted, or any number
of other features. There is no limit to the number of attributes
attached to the dictionary entries, and the mapping from them to the
resultant annotations is configurable in the AE descriptor.

So if I understand you correct, the dictionary XML format is not
predefined. The XML tags
used to specify the dictionary content are related to the used UIMA
type system. How do you
check for errors in the dictionary definition?

The predefined portion of the XML is:

<token>
   <variant base="text string" />
   <variant base="text string2" />
   ...
</token>

which defines an entry with two variants. It is any additional
attributes that you might want to add (POS tag, code, etc.) that is not predefined, but also not required. The only error checking would be that
the SAX parser would throw an exception if the above structure is not
intact.



The resulting annotations are specified in the AE descriptor. So I
think you have a mapping from
dictionary XML elements/features to UIMA types/features? Is there a
default mapping?


There is no default mapping. Any identifying attributes that you might want transfered to the resultant annotations can be put into the token
element or the individual variant elements, but as I said, that is
optional.
OK so when adding a feature "email" to the token as shown in the example below, there have to be an "email" feature (String valued) in the type system for the created result annotation.

<token email="[EMAIL PROTECTED]>
  <variant base="John Doe" />
</token>

Let's say you specify that the resultant annotations are going to be of the type "DictTerm", and each had a feature "EMailAddress". You could then specify the mapping from "email" to "EMailAddress" in the descriptor, and then when "John Doe" was found in the text, it would be annotated with a DictTerm with an "EMailAddress" of "[EMAIL PROTECTED] ".






Can the dictionaries also be language specific?

Well, I am not sure what this means. If you mean: will ConceptMapper
load different dictionaries depending on the language setting, then the
answer is no. It currently allows only one dictionary resource to be
specified, and it will be loaded if necessary. I would agree that this
would be a nice feature to incorporate, though.

I just want to know if the dictionary can have a language setting. In some cases the dictionary content is language specific and so I want to add a setting to the dictionary that the content
should be used for English only.

So your reply answers my question :-)

Right, whichever dictionary you set as the DictionaryFile resource is the one that is used.






ConceptMapper only has provisions for using one dictionary per instance,
though this is probably a relatively simple thing to augment.

ConceptMapper dictionaries are implemented as shared resources. It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionary that has
been parsed and then serialized, to allow for quick loading.

The DictionaryAnnotator cannot share dictionaries, since the
dictionaries are compiled
to internal data structures during initialization of the annotator.

The same is true of ConceptMapper, so I am not sure how useful it is
that it is an UIMA resource. Nevertheless, it is one, and other
instantiations of ConceptMapper could attach to that resource, if needed.




In addition to the ability to do case-normalized matching, which both provide, ConceptMapper provides a mechanism to use a stemmer, which is
applied to both the dictionary and the input documents.

Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?

None is provided. To adapt one for use, it needs to adhere to a simple
interface:

public interface Stemmer {
   public String stem(String token);
   public void initialize(String dictionary) throws
FileNotFoundException, ParseException;
}

The only method that has to do anything is stem(), which takes a string
in and returns a string. Using this, it was quite simple to integrate
the open source Snowball stemmer.




Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as well as an optional feature within that annotation, with both defaulting to the covered text. ConceptMapper also allows an annotation type to be used to bound lookups (e.g. a sentence at a time, or an NP at a time, etc.). Perhaps this was an oversight on my part, but I did not see this
in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also
allows an
annotation type to be used to bound lookups". Can you give an example?

What I mean is that ConceptMapper works span by span, and that span is
specified in the descriptor. Typically, that span is a sentence, but
could be an NP or even the whole document. Dictionary lookups are
limited to tokens that appear within a single span--no crossing of span
boundaries are allowed. Does this make sense?

Yes, thanks!




Token skipping is an option in both systems, though it is implemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case of omitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in its implementation.
Perhaps the filter conditions of the current DictionaryAnnotator is
better.

Finally, and again this may be due an oversight on my part in reading the documentation, it is not clear what the search strategy is for the current DictionaryAnnotator, but I would assume it finds non- overlapping longest matches. While ConceptMapper supports this as a default, there are three parameters in the AE descriptor to control the way the search
is done.

Right, you cannot configure the matching strategy for the
DictionaryAnnotator.
Currently the matching strategy is "first longest match" and no
"overlapping"
annotations are created. So you are right non-overlapping longest
matches.


Altogether, I see advantages for both system. I'm not sure if there is
a way to
create one Dictionary component with the advantages of both since some
of the
base concepts are different e.g. dictionary content object. But maybe
we can try :-)

-- Michael


-- Michael

Reply via email to