Re: Any interest in this as an open source project?

Michael Baessler Tue, 13 May 2008 06:54:25 -0700

Altogether sounds good to me, I'm interessted :-)

-- Michael


Michael Tanenblatt wrote:
> 
> On May 13, 2008, at 2:31 AM, Michael Baessler wrote:
> 
>> Michael Tanenblatt wrote:
>>> My comments inline, below:
>>>
>>> On May 10, 2008, at 2:56 AM, Michael Baessler wrote:
>>>
>>>> Hi Michael,
>>>>
>>>> thanks for the detailed comparison. ConceptMapper seems to be very
>>>> interesting
>>>> but I have some additional questions. Please see my comments below:
>>>>
>>>> Michael Tanenblatt wrote:
>>>>> OK, good question. I have never used the project that is in the
>>>>> sandbox
>>>>> as ConceptMapper has been in development and production for a long
>>>>> time,
>>>>> so my comparisons are based solely on what I gleaned from the
>>>>> documentation. From this cursory knowledge of the DictionaryAnnotator
>>>>> that is already in the sandbox, I think that ConceptMapper provides
>>>>> significantly more functionality and customizability, while seemingly
>>>>> providing all of the functionality of the current DictionaryAnnotator.
>>>>> Here is a comparison, with the caveats claimed earlier regarding my
>>>>> level of familiarity with the current DictionaryAnnotator:
>>>>>
>>>>> Both annotators allow for the use of the same tokenizer in dictionary
>>>>> tokenization as is used in the processing pipeline, though in slightly
>>>>> different ways (descriptor vs. Pear file). ConceptMapper has no
>>>>> default
>>>>> tokenizer, though there is a simple one included in the package.
>>>>
>>>> I think having a default tokenizer is important for the "ease of use"
>>>> of the
>>>> dictionary component. If users just want to use a simple list of
>>>> words(multi words) for processing,
>>>> they don't want to setup a separate tokenizer to create the
>>>> dictionary. Can explain
>>>> more detailed what a user have to do to tokenize the content.
>>>
>>> I am not sure I agree with you on this point. Since the default setup
>>> for ConceptMapper is to tokenize the dictionary at load time, which is
>>> when the processing pipeline is set up, and since there will need to be
>>> a tokenizer in the pipeline for processing the the input text, I think
>>> that it is perfectly reasonable to require the specification of a
>>> tokenizer in the ConceptMapper AE descriptor for use as the tokenizer
>>> for the dictionary. This enforces the point that the same tokenizer be
>>> used for tokenizing the dictionary as the input data, which is actually
>>> something that I believe to be more than reasonable. In fact, I think it
>>> *should* be a requirement, otherwise entries that are in the dictionary
>>> might not be found, due to a tokenization mismatch.
>>>
>>> To simplify setup for naïve users, as I said, there is a simple
>>> tokenizer annotator included in the ConceptMapper package, and that
>>> could be used for both the dictionary and text processing. It breaks on
>>> whitespace, plus any other character specified in a parameter.
>>
>> It was not my intention to say that I disagree with the way how
>> ConceptMapper
>> does the tokenization of the dictionary content. I was not aware that
>> you use one of the tokenizers
>> of the processing pipeline. May I missed that in one of your previous
>> mails.
>>
>> The way ConceptMapper does the tokenization is fine with me.
>> Are there any special requirements to the tokenizer (multi threading,
>> resources ...)? I think you
>> create your own instance of the tokenizer during initialization of
>> ConceptMapper.
> 
> The tokenizer is created using UIMAFramework.produceAnalysisEngine().
> The subsequent analysis engine is run against a dictionary entry's text 
> using its process() method and the tokens created in the CAS as a result
> of this process are then collected and then the CAS is reset for the
> next entry to be processed.
> 
>>
>>
>> Some tokenizers produce different results based on the document
>> language. Is there a setting
>> to specify the language that should be used to tokenize the dictionary
>> content?
> 
> There is a AE descriptor parameter that is passed to
> setDocumentLanguage() for the tokenizer processing of the dictionary
> entries.
> 
>>
>>>
>>>>
>>>>
>>>>>
>>>>> One clear difference is that there is no dictionary creator for
>>>>> ConceptMapper; instead, you must build the XML file by hand. This is
>>>>> due, in part, to the fact that dictionary entries can have arbitrary
>>>>> attributes associated with them. This leads to what I think is a
>>>>> serious
>>>>> advantage of ConceptMapper: these attributes associated with
>>>>> dictionary
>>>>> entries can be copied to the annotations that are created in
>>>>> response to
>>>>> a successful lookup. This is very useful for attaching a code from
>>>>> some
>>>>> coding scheme (e.g., from a medical lexicon or ontology) or a
>>>>> reference
>>>>> to a document in which the term was originally extracted, or any
>>>>> number
>>>>> of other features. There is no limit to the number of attributes
>>>>> attached to the dictionary entries, and the mapping from them to the
>>>>> resultant annotations is configurable in the AE descriptor.
>>>>
>>>> So if I understand you correct, the dictionary XML format is not
>>>> predefined. The XML tags
>>>> used to specify the dictionary content are related to the used UIMA
>>>> type system. How do you
>>>> check for errors in the dictionary definition?
>>>
>>> The predefined portion of the XML is:
>>>
>>> <token>
>>>    <variant base="text string" />
>>>    <variant base="text string2" />
>>>    ...
>>> </token>
>>>
>>> which defines an entry with two variants. It is any additional
>>> attributes that you might want to add (POS tag, code, etc.) that is not
>>> predefined, but also not required. The only error checking would be that
>>> the SAX parser would throw an exception if the above structure is not
>>> intact.
>>>
>>>>
>>>>
>>>> The resulting annotations are specified in the AE descriptor. So I
>>>> think you have a mapping from
>>>> dictionary XML elements/features to UIMA types/features? Is there a
>>>> default mapping?
>>>>
>>>
>>> There is no default mapping. Any identifying attributes that you might
>>> want transfered to the resultant annotations can be put into the token
>>> element or the individual variant elements, but as I said, that is
>>> optional.
>> OK so when adding a feature "email" to the token as shown in the
>> example below, there have to be an
>> "email" feature (String valued) in the type system for the created
>> result annotation.
>>
>> <token email="[EMAIL PROTECTED]>
>>   <variant base="John Doe" />
>> </token>
> 
> Let's say you specify that the resultant annotations are going to be of
> the type "DictTerm", and each had a feature "EMailAddress". You could
> then specify the mapping from "email" to "EMailAddress" in the
> descriptor, and then when "John Doe" was found in the text, it would be
> annotated with a DictTerm with an "EMailAddress" of
> "[EMAIL PROTECTED]".
> 
> 
> 
>>
>>>
>>>
>>>> Can the dictionaries also be language specific?
>>>
>>> Well, I am not sure what this means. If you mean: will ConceptMapper
>>> load different dictionaries depending on the language setting, then the
>>> answer is no. It currently allows only one dictionary resource to be
>>> specified, and it will be loaded if necessary. I would agree that this
>>> would be a nice feature to incorporate, though.
>>
>> I just want to know if the dictionary can have a language setting. In
>> some cases the dictionary
>> content is language specific and so I want to add a setting to the
>> dictionary that the content
>> should be used for English only.
>>
>> So your reply answers my question :-)
> 
> Right, whichever dictionary you set as the DictionaryFile resource is
> the one that is used.
> 
>>
>>>
>>>>
>>>>
>>>>>
>>>>> ConceptMapper only has provisions for using one dictionary per
>>>>> instance,
>>>>> though this is probably a relatively simple thing to augment.
>>>>>
>>>>> ConceptMapper dictionaries are implemented as shared resources. It is
>>>>> not clear if this is the case for the DictionaryAnnotator in the
>>>>> sandbox. One could also create a new implementation of the
>>>>> DictionaryResource interface. This was done in the case of the
>>>>> CompiledDictionaryResource_impl, which operates on a dictionary
>>>>> that has
>>>>> been parsed and then serialized, to allow for quick loading.
>>>>
>>>> The DictionaryAnnotator cannot share dictionaries, since the
>>>> dictionaries are compiled
>>>> to internal data structures during initialization of the annotator.
>>>
>>> The same is true of ConceptMapper, so I am not sure how useful it is
>>> that it is an UIMA resource. Nevertheless, it is one, and other
>>> instantiations of ConceptMapper could attach to that resource, if
>>> needed.
>>>
>>>>
>>>>
>>>>>
>>>>> In addition to the ability to do case-normalized matching, which both
>>>>> provide, ConceptMapper provides a mechanism to use a stemmer, which is
>>>>> applied to both the dictionary and the input documents.
>>>>
>>>> Is the stemmer provided with the ConceptMapper package?
>>>> If not, how is it integrated?
>>>
>>> None is provided. To adapt one for use, it needs to adhere to a simple
>>> interface:
>>>
>>> public interface Stemmer {
>>>    public String stem(String token);
>>>    public void initialize(String dictionary) throws
>>> FileNotFoundException, ParseException;
>>> }
>>>
>>> The only method that has to do anything is stem(), which takes a string
>>> in and returns a string. Using this, it was quite simple to integrate
>>> the open source Snowball stemmer.
>>>
>>>
>>>>
>>>>>
>>>>> Both systems provide the ability to specify the particular type of
>>>>> annotation to consider in lookups (e.g., uima.tt.TokenAnnotation), as
>>>>> well as an optional feature within that annotation, with both
>>>>> defaulting
>>>>> to the covered text. ConceptMapper also allows an annotation type
>>>>> to be
>>>>> used to bound lookups (e.g. a sentence at a time, or an NP at a time,
>>>>> etc.). Perhaps this was an oversight on my part, but I did not see
>>>>> this
>>>>> in the existing sandbox annotator.
>>>> Sorry, I don't understand what do you mean by "ConceptMapper also
>>>> allows an
>>>> annotation type to be used to bound lookups". Can you give an example?
>>>
>>> What I mean is that ConceptMapper works span by span, and that span is
>>> specified in the descriptor. Typically, that span is a sentence, but
>>> could be an NP or even the whole document. Dictionary lookups are
>>> limited to tokens that appear within a single span--no crossing of span
>>> boundaries are allowed. Does this make sense?
>>
>> Yes, thanks!
>>
>>>
>>>>
>>>>>
>>>>> Token skipping is an option in both systems, though it is implemented
>>>>> differently. ConceptMapper includes has two methods available: the
>>>>> ability to use a stop-word list to handle the simple case of omitting
>>>>> tokens based in lexical equality, and feature-based include/exclude
>>>>> lists. The latter is not as general as I'd like in its implementation.
>>>>> Perhaps the filter conditions of the current DictionaryAnnotator is
>>>>> better.
>>>>>
>>>>> Finally, and again this may be due an oversight on my part in reading
>>>>> the documentation, it is not clear what the search strategy is for the
>>>>> current DictionaryAnnotator, but I would assume it finds
>>>>> non-overlapping
>>>>> longest matches. While ConceptMapper supports this as a default, there
>>>>> are three parameters in the AE descriptor to control the way the
>>>>> search
>>>>> is done.
>>>>
>>>> Right, you cannot configure the matching strategy for the
>>>> DictionaryAnnotator.
>>>> Currently the matching strategy is "first longest match" and no
>>>> "overlapping"
>>>> annotations are created. So you are right non-overlapping longest
>>>> matches.
>>>>
>>>>
>>>> Altogether, I see advantages for both system. I'm not sure if there is
>>>> a way to
>>>> create one Dictionary component with the advantages of both since some
>>>> of the
>>>> base concepts are different e.g. dictionary content object. But maybe
>>>> we can try :-)
>>>>
>>>> -- Michael
>>>
>>
>> -- Michael
>

Re: Any interest in this as an open source project?

Reply via email to