Michael Tanenblatt wrote:
> My comments inline, below:
>
> On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
>
>> Hi Michael,
>>
>> thanks for the detailed comparison. ConceptMapper seems to be very
>> interesting
>> but I have some additional questions. Please see my comments below:
>>
>> Michael Tanenblatt wrote:
>>> OK, good question. I have never used the project that is in the sandbox
>>> as ConceptMapper has been in development and production for a long time,
>>> so my comparisons are based solely on what I gleaned from the
>>> documentation. From this cursory knowledge of the DictionaryAnnotator
>>> that is already in the sandbox, I think that ConceptMapper provides
>>> significantly more functionality and customizability, while seemingly
>>> providing all of the functionality of the current DictionaryAnnotator.
>>> Here is a comparison, with the caveats claimed earlier regarding my
>>> level of familiarity with the current DictionaryAnnotator:
>>>
>>> Both annotators allow for the use of the same tokenizer in dictionary
>>> tokenization as is used in the processing pipeline, though in slightly
>>> different ways (descriptor vs. Pear file). ConceptMapper has no default
>>> tokenizer, though there is a simple one included in the package.
>>
>> I think having a default tokenizer is important for the "ease of use"
>> of the
>> dictionary component. If users just want to use a simple list of
>> words(multi words) for processing,
>> they don't want to setup a separate tokenizer to create the
>> dictionary. Can explain
>> more detailed what a user have to do to tokenize the content.
>
> I am not sure I agree with you on this point. Since the default setup
> for ConceptMapper is to tokenize the dictionary at load time, which is
> when the processing pipeline is set up, and since there will need to be
> a tokenizer in the pipeline for processing the the input text, I think
> that it is perfectly reasonable to require the specification of a
> tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
> for the dictionary. This enforces the point that the same tokenizer be
> used for tokenizing the dictionary as the input data, which is actually
> something that I believe to be more than reasonable. In fact, I think it
> *should* be a requirement, otherwise entries that are in the dictionary
> might not be found, due to a tokenization mismatch.
>
> To simplify setup for naïve users, as I said, there is a simple
> tokenizer annotator included in the ConceptMapper package, and that
> could be used for both the dictionary and text processing. It breaks on
> whitespace, plus any other character specified in a parameter.
It was not my intention to say that I disagree with the way how ConceptMapper
does the tokenization of the dictionary content. I was not aware that you use
one of the tokenizers
of the processing pipeline. May I missed that in one of your previous mails.
The way ConceptMapper does the tokenization is fine with me.
Are there any special requirements to the tokenizer (multi threading, resources
...)? I think you
create your own instance of the tokenizer during initialization of
ConceptMapper.
Some tokenizers produce different results based on the document language. Is
there a setting
to specify the language that should be used to tokenize the dictionary content?
>
>>
>>
>>>
>>> One clear difference is that there is no dictionary creator for
>>> ConceptMapper; instead, you must build the XML file by hand. This is
>>> due, in part, to the fact that dictionary entries can have arbitrary
>>> attributes associated with them. This leads to what I think is a serious
>>> advantage of ConceptMapper: these attributes associated with dictionary
>>> entries can be copied to the annotations that are created in response to
>>> a successful lookup. This is very useful for attaching a code from some
>>> coding scheme (e.g., from a medical lexicon or ontology) or a reference
>>> to a document in which the term was originally extracted, or any number
>>> of other features. There is no limit to the number of attributes
>>> attached to the dictionary entries, and the mapping from them to the
>>> resultant annotations is configurable in the AE descriptor.
>>
>> So if I understand you correct, the dictionary XML format is not
>> predefined. The XML tags
>> used to specify the dictionary content are related to the used UIMA
>> type system. How do you
>> check for errors in the dictionary definition?
>
> The predefined portion of the XML is:
>
> <token>
> <variant base="text string" />
> <variant base="text string2" />
> ...
> </token>
>
> which defines an entry with two variants. It is any additional
> attributes that you might want to add (POS tag, code, etc.) that is not
> predefined, but also not required. The only error checking would be that
> the SAX parser would throw an exception if the above structure is not
> intact.
>
>>
>>
>> The resulting annotations are specified in the AE descriptor. So I
>> think you have a mapping from
>> dictionary XML elements/features to UIMA types/features? Is there a
>> default mapping?
>>
>
> There is no default mapping. Any identifying attributes that you might
> want transfered to the resultant annotations can be put into the token
> element or the individual variant elements, but as I said, that is
> optional.
OK so when adding a feature "email" to the token as shown in the example below,
there have to be an
"email" feature (String valued) in the type system for the created result
annotation.
<token email="[EMAIL PROTECTED]>
<variant base="John Doe" />
</token>
>
>
>> Can the dictionaries also be language specific?
>
> Well, I am not sure what this means. If you mean: will ConceptMapper
> load different dictionaries depending on the language setting, then the
> answer is no. It currently allows only one dictionary resource to be
> specified, and it will be loaded if necessary. I would agree that this
> would be a nice feature to incorporate, though.
I just want to know if the dictionary can have a language setting. In some
cases the dictionary
content is language specific and so I want to add a setting to the dictionary
that the content
should be used for English only.
So your reply answers my question :-)
>
>>
>>
>>>
>>> ConceptMapper only has provisions for using one dictionary per instance,
>>> though this is probably a relatively simple thing to augment.
>>>
>>> ConceptMapper dictionaries are implemented as shared resources. It is
>>> not clear if this is the case for the DictionaryAnnotator in the
>>> sandbox. One could also create a new implementation of the
>>> DictionaryResource interface. This was done in the case of the
>>> CompiledDictionaryResource_impl, which operates on a dictionary that has
>>> been parsed and then serialized, to allow for quick loading.
>>
>> The DictionaryAnnotator cannot share dictionaries, since the
>> dictionaries are compiled
>> to internal data structures during initialization of the annotator.
>
> The same is true of ConceptMapper, so I am not sure how useful it is
> that it is an UIMA resource. Nevertheless, it is one, and other
> instantiations of ConceptMapper could attach to that resource, if needed.
>
>>
>>
>>>
>>> In addition to the ability to do case-normalized matching, which both
>>> provide, ConceptMapper provides a mechanism to use a stemmer, which is
>>> applied to both the dictionary and the input documents.
>>
>> Is the stemmer provided with the ConceptMapper package?
>> If not, how is it integrated?
>
> None is provided. To adapt one for use, it needs to adhere to a simple
> interface:
>
> public interface Stemmer {
> public String stem(String token);
> public void initialize(String dictionary) throws
> FileNotFoundException, ParseException;
> }
>
> The only method that has to do anything is stem(), which takes a string
> in and returns a string. Using this, it was quite simple to integrate
> the open source Snowball stemmer.
>
>
>>
>>>
>>> Both systems provide the ability to specify the particular type of
>>> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
>>> well as an optional feature within that annotation, with both defaulting
>>> to the covered text. ConceptMapper also allows an annotation type to be
>>> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
>>> etc.). Perhaps this was an oversight on my part, but I did not see this
>>> in the existing sandbox annotator.
>> Sorry, I don't understand what do you mean by "ConceptMapper also
>> allows an
>> annotation type to be used to bound lookups". Can you give an example?
>
> What I mean is that ConceptMapper works span by span, and that span is
> specified in the descriptor. Typically, that span is a sentence, but
> could be an NP or even the whole document. Dictionary lookups are
> limited to tokens that appear within a single span--no crossing of span
> boundaries are allowed. Does this make sense?
Yes, thanks!
>
>>
>>>
>>> Token skipping is an option in both systems, though it is implemented
>>> differently. ConceptMapper includes has two methods available: the
>>> ability to use a stop-word list to handle the simple case of omitting
>>> tokens based in lexical equality, and feature-based include/exclude
>>> lists. The latter is not as general as I'd like in its implementation.
>>> Perhaps the filter conditions of the current DictionaryAnnotator is
>>> better.
>>>
>>> Finally, and again this may be due an oversight on my part in reading
>>> the documentation, it is not clear what the search strategy is for the
>>> current DictionaryAnnotator, but I would assume it finds non-overlapping
>>> longest matches. While ConceptMapper supports this as a default, there
>>> are three parameters in the AE descriptor to control the way the search
>>> is done.
>>
>> Right, you cannot configure the matching strategy for the
>> DictionaryAnnotator.
>> Currently the matching strategy is "first longest match" and no
>> "overlapping"
>> annotations are created. So you are right non-overlapping longest
>> matches.
>>
>>
>> Altogether, I see advantages for both system. I'm not sure if there is
>> a way to
>> create one Dictionary component with the advantages of both since some
>> of the
>> base concepts are different e.g. dictionary content object. But maybe
>> we can try :-)
>>
>> -- Michael
>
-- Michael