Re: Any interest in this as an open source project?

Michael Tanenblatt Tue, 13 May 2008 06:47:06 -0700


On May 13, 2008, at 2:31 AM, Michael Baessler wrote:

Michael Tanenblatt wrote:
My comments inline, below:

On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
Hi Michael,

thanks for the detailed comparison. ConceptMapper seems to be very
interesting
but I have some additional questions. Please see my comments below:

Michael Tanenblatt wrote:
OK, good question. I have never used the project that is in thesandboxas ConceptMapper has been in development and production for along time,
so my comparisons are based solely on what I gleaned from the
documentation. From this cursory knowledge of theDictionaryAnnotator
that is already in the sandbox, I think that ConceptMapper provides
significantly more functionality and customizability, whileseeminglyproviding all of the functionality of the currentDictionaryAnnotator.
Here is a comparison, with the caveats claimed earlier regarding my
level of familiarity with the current DictionaryAnnotator:
Both annotators allow for the use of the same tokenizer indictionarytokenization as is used in the processing pipeline, though inslightlydifferent ways (descriptor vs. Pear file). ConceptMapper has nodefault
tokenizer, though there is a simple one included in the package.
I think having a default tokenizer is important for the "ease ofuse"
of the
dictionary component. If users just want to use a simple list of
words(multi words) for processing,
they don't want to setup a separate tokenizer to create the
dictionary. Can explain
more detailed what a user have to do to tokenize the content.
I am not sure I agree with you on this point. Since the default setup
for ConceptMapper is to tokenize the dictionary at load time, whichiswhen the processing pipeline is set up, and since there will needto bea tokenizer in the pipeline for processing the the input text, Ithink
that it is perfectly reasonable to require the specification of a
tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
for the dictionary. This enforces the point that the same tokenizerbeused for tokenizing the dictionary as the input data, which isactuallysomething that I believe to be more than reasonable. In fact, Ithink it*should* be a requirement, otherwise entries that are in thedictionary
might not be found, due to a tokenization mismatch.

To simplify setup for naïve users, as I said, there is a simple
tokenizer annotator included in the ConceptMapper package, and that
could be used for both the dictionary and text processing. Itbreaks on
whitespace, plus any other character specified in a parameter.
It was not my intention to say that I disagree with the way howConceptMapperdoes the tokenization of the dictionary content. I was not awarethat you use one of the tokenizersof the processing pipeline. May I missed that in one of yourprevious mails.
The way ConceptMapper does the tokenization is fine with me.
Are there any special requirements to the tokenizer (multithreading, resources ...)? I think youcreate your own instance of the tokenizer during initialization ofConceptMapper.

The tokenizer is created using UIMAFramework.produceAnalysisEngine().The subsequent analysis engine is run against a dictionary entry'stext using its process() method and the tokens created in the CAS asa result of this process are then collected and then the CAS is resetfor the next entry to be processed.

Some tokenizers produce different results based on the documentlanguage. Is there a settingto specify the language that should be used to tokenize thedictionary content?

There is a AE descriptor parameter that is passed tosetDocumentLanguage() for the tokenizer processing of the dictionaryentries.

One clear difference is that there is no dictionary creator for
ConceptMapper; instead, you must build the XML file by hand. Thisisdue, in part, to the fact that dictionary entries can havearbitraryattributes associated with them. This leads to what I think is aseriousadvantage of ConceptMapper: these attributes associated withdictionaryentries can be copied to the annotations that are created inresponse toa successful lookup. This is very useful for attaching a codefrom somecoding scheme (e.g., from a medical lexicon or ontology) or areferenceto a document in which the term was originally extracted, or anynumber
of other features. There is no limit to the number of attributes
attached to the dictionary entries, and the mapping from them tothe
resultant annotations is configurable in the AE descriptor.
So if I understand you correct, the dictionary XML format is not
predefined. The XML tags
used to specify the dictionary content are related to the used UIMA
type system. How do you
check for errors in the dictionary definition?
The predefined portion of the XML is:

<token>
   <variant base="text string" />
   <variant base="text string2" />
   ...
</token>

which defines an entry with two variants. It is any additional
attributes that you might want to add (POS tag, code, etc.) that isnotpredefined, but also not required. The only error checking would bethat
the SAX parser would throw an exception if the above structure is not
intact.
The resulting annotations are specified in the AE descriptor. So I
think you have a mapping from
dictionary XML elements/features to UIMA types/features? Is there a
default mapping?
There is no default mapping. Any identifying attributes that youmightwant transfered to the resultant annotations can be put into thetoken
element or the individual variant elements, but as I said, that is
optional.
OK so when adding a feature "email" to the token as shown in theexample below, there have to be an"email" feature (String valued) in the type system for the createdresult annotation.
<token email="[EMAIL PROTECTED]>
  <variant base="John Doe" />
</token>

Let's say you specify that the resultant annotations are going to beof the type "DictTerm", and each had a feature "EMailAddress". Youcould then specify the mapping from "email" to "EMailAddress" in thedescriptor, and then when "John Doe" was found in the text, it wouldbe annotated with a DictTerm with an "EMailAddress" of "[EMAIL PROTECTED]".

Can the dictionaries also be language specific?
Well, I am not sure what this means. If you mean: will ConceptMapper
load different dictionaries depending on the language setting, thenthe
answer is no. It currently allows only one dictionary resource to be
specified, and it will be loaded if necessary. I would agree thatthis
would be a nice feature to incorporate, though.
I just want to know if the dictionary can have a language setting.In some cases the dictionarycontent is language specific and so I want to add a setting to thedictionary that the content
should be used for English only.

So your reply answers my question :-)

Right, whichever dictionary you set as the DictionaryFile resource isthe one that is used.

ConceptMapper only has provisions for using one dictionary perinstance,
though this is probably a relatively simple thing to augment.
ConceptMapper dictionaries are implemented as shared resources.It is
not clear if this is the case for the DictionaryAnnotator in the
sandbox. One could also create a new implementation of the
DictionaryResource interface. This was done in the case of the
CompiledDictionaryResource_impl, which operates on a dictionarythat has
been parsed and then serialized, to allow for quick loading.
The DictionaryAnnotator cannot share dictionaries, since the
dictionaries are compiled
to internal data structures during initialization of the annotator.
The same is true of ConceptMapper, so I am not sure how useful it is
that it is an UIMA resource. Nevertheless, it is one, and other
instantiations of ConceptMapper could attach to that resource, ifneeded.
In addition to the ability to do case-normalized matching, whichbothprovide, ConceptMapper provides a mechanism to use a stemmer,which is
applied to both the dictionary and the input documents.
Is the stemmer provided with the ConceptMapper package?
If not, how is it integrated?
None is provided. To adapt one for use, it needs to adhere to asimple
interface:

public interface Stemmer {
   public String stem(String token);
   public void initialize(String dictionary) throws
FileNotFoundException, ParseException;
}
The only method that has to do anything is stem(), which takes astring
in and returns a string. Using this, it was quite simple to integrate
the open source Snowball stemmer.
Both systems provide the ability to specify the particular type of
annotation to consider in lookups (e.g.,uima.tt.TokenAnnotation), aswell as an optional feature within that annotation, with bothdefaultingto the covered text. ConceptMapper also allows an annotation typeto beused to bound lookups (e.g. a sentence at a time, or an NP at atime,etc.). Perhaps this was an oversight on my part, but I did notsee this
in the existing sandbox annotator.
Sorry, I don't understand what do you mean by "ConceptMapper also
allows an
annotation type to be used to bound lookups". Can you give anexample?
What I mean is that ConceptMapper works span by span, and that spanis
specified in the descriptor. Typically, that span is a sentence, but
could be an NP or even the whole document. Dictionary lookups are
limited to tokens that appear within a single span--no crossing ofspan
boundaries are allowed. Does this make sense?
Yes, thanks!
Token skipping is an option in both systems, though it isimplemented
differently. ConceptMapper includes has two methods available: the
ability to use a stop-word list to handle the simple case ofomitting
tokens based in lexical equality, and feature-based include/exclude
lists. The latter is not as general as I'd like in itsimplementation.
Perhaps the filter conditions of the current DictionaryAnnotator is
better.
Finally, and again this may be due an oversight on my part inreadingthe documentation, it is not clear what the search strategy isfor thecurrent DictionaryAnnotator, but I would assume it finds non-overlappinglongest matches. While ConceptMapper supports this as a default,thereare three parameters in the AE descriptor to control the way thesearch
is done.
Right, you cannot configure the matching strategy for the
DictionaryAnnotator.
Currently the matching strategy is "first longest match" and no
"overlapping"
annotations are created. So you are right non-overlapping longest
matches.
Altogether, I see advantages for both system. I'm not sure ifthere is
a way to
create one Dictionary component with the advantages of both sincesome
of the
base concepts are different e.g. dictionary content object. Butmaybe
we can try :-)

-- Michael
-- Michael

Re: Any interest in this as an open source project?

Reply via email to