My comments inline, below:
On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
Hi Michael,
thanks for the detailed comparison. ConceptMapper seems to be very
interesting
but I have some additional questions. Please see my comments below:
Michael Tanenblatt wrote:
OK, good question. I have never used the project that is in the
sandbox
as ConceptMapper has been in development and production for a long
time,
so my comparisons are based solely on what I gleaned from the
documentation. From this cursory knowledge of the DictionaryAnnotator
that is already in the sandbox, I think that ConceptMapper provides
significantly more functionality and customizability, while seemingly
providing all of the functionality of the current
DictionaryAnnotator.
Here is a comparison, with the caveats claimed earlier regarding my
level of familiarity with the current DictionaryAnnotator:
Both annotators allow for the use of the same tokenizer in dictionary
tokenization as is used in the processing pipeline, though in
slightly
different ways (descriptor vs. Pear file). ConceptMapper has no
default
tokenizer, though there is a simple one included in the package.
I think having a default tokenizer is important for the "ease of
use" of the
dictionary component. If users just want to use a simple list of
words(multi words) for processing,
they don't want to setup a separate tokenizer to create the
dictionary. Can explain
more detailed what a user have to do to tokenize the content.
I am not sure I agree with you on this point. Since the default setup
for ConceptMapper is to tokenize the dictionary at load time, which is
when the processing pipeline is set up, and since there will need to
be a tokenizer in the pipeline for processing the the input text, I
think that it is perfectly reasonable to require the specification of
a tokenizer in the ConceptMapper AE descriptor for use as the
tokenizer for the dictionary. This enforces the point that the same
tokenizer be used for tokenizing the dictionary as the input data,
which is actually something that I believe to be more than reasonable.
In fact, I think it *should* be a requirement, otherwise entries that
are in the dictionary might not be found, due to a tokenization
mismatch.
To simplify setup for naïve users, as I said, there is a simple
tokenizer annotator included in the ConceptMapper package, and that
could be used for both the dictionary and text processing. It breaks
on whitespace, plus any other character specified in a parameter.
One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. This is
due, in part, to the fact that dictionary entries can have arbitrary
attributes associated with them. This leads to what I think is a
serious
advantage of ConceptMapper: these attributes associated with
dictionary
entries can be copied to the annotations that are created in
response to
a successful lookup. This is very useful for attaching a code from
some
coding scheme (e.g., from a medical lexicon or ontology) or a
reference
to a document in which the term was originally extracted, or any
number
of other features. There is no limit to the number of attributes
attached to the dictionary entries, and the mapping from them to the
resultant annotations is configurable in the AE descriptor.
So if I understand you correct, the dictionary XML format is not
predefined. The XML tags
used to specify the dictionary content are related to the used UIMA
type system. How do you
check for errors in the dictionary definition?
The predefined portion of the XML is:
<token>
<variant base="text string" />
<variant base="text string2" />
...
</token>
which defines an entry with two variants. It is any additional
attributes that you might want to add (POS tag, code, etc.) that is
not predefined, but also not required. The only error checking would
be that the SAX parser would throw an exception if the above structure
is not intact.
The resulting annotations are specified in the AE descriptor. So I
think you have a mapping from
dictionary XML elements/features to UIMA types/features? Is there a
default mapping?
There is no default mapping. Any identifying attributes that you might
want transfered to the resultant annotations can be put into the token
element or the individual variant elements, but as I said, that is
optional.
Can the dictionaries also be language specific?
Well, I am not sure what this means. If you mean: will ConceptMapper
load different dictionaries depending on the language setting, then
the answer is no. It currently allows only one dictionary resource to
be specified, and it will be loaded if necessary. I would agree that
this would be a nice feature to incorporate, though.
ConceptMapper only has provisions for using one dictionary per
instance,
though this is probably a relatively simple thing to augment.
ConceptMapper dictionaries are implemented as shared resources. It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionary
that has
been parsed and then serialized, to allow for quick loading.
The DictionaryAnnotator cannot share dictionaries, since the
dictionaries are compiled
to internal data structures during initialization of the annotator.
The same is true of ConceptMapper, so I am not sure how useful it is
that it is an UIMA resource. Nevertheless, it is one, and other
instantiations of ConceptMapper could attach to that resource, if
needed.
In addition to the ability to do case-normalized matching, which both
provide, ConceptMapper provides a mechanism to use a stemmer, which
is
applied to both the dictionary and the input documents.
Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?
None is provided. To adapt one for use, it needs to adhere to a simple
interface:
public interface Stemmer {
public String stem(String token);
public void initialize(String dictionary) throws
FileNotFoundException, ParseException;
}
The only method that has to do anything is stem(), which takes a
string in and returns a string. Using this, it was quite simple to
integrate the open source Snowball stemmer.
Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
well as an optional feature within that annotation, with both
defaulting
to the covered text. ConceptMapper also allows an annotation type
to be
used to bound lookups (e.g. a sentence at a time, or an NP at a time,
etc.). Perhaps this was an oversight on my part, but I did not see
this
in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also
allows an
annotation type to be used to bound lookups". Can you give an example?
What I mean is that ConceptMapper works span by span, and that span is
specified in the descriptor. Typically, that span is a sentence, but
could be an NP or even the whole document. Dictionary lookups are
limited to tokens that appear within a single span--no crossing of
span boundaries are allowed. Does this make sense?
Token skipping is an option in both systems, though it is implemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case of omitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in its
implementation.
Perhaps the filter conditions of the current DictionaryAnnotator is
better.
Finally, and again this may be due an oversight on my part in reading
the documentation, it is not clear what the search strategy is for
the
current DictionaryAnnotator, but I would assume it finds non-
overlapping
longest matches. While ConceptMapper supports this as a default,
there
are three parameters in the AE descriptor to control the way the
search
is done.
Right, you cannot configure the matching strategy for the
DictionaryAnnotator.
Currently the matching strategy is "first longest match" and no
"overlapping"
annotations are created. So you are right non-overlapping longest
matches.
Altogether, I see advantages for both system. I'm not sure if there
is a way to
create one Dictionary component with the advantages of both since
some of the
base concepts are different e.g. dictionary content object. But
maybe we can try :-)
-- Michael