Any tokenizer should work--I only suggested using the whitespace tokenizer as a 
way to see if the tokenizer you are using may be the problem in this case. As 
to the DictionaryAnnotator, I really can't say much, since I haven't used it. 


On Mar 20, 2013, at 10:56 AM, Andreas Niekler 
<[email protected]> wrote:

> I will investigate if this is the case. I will try to only use a
> whitespace Tokenizer.
> 
> Is there any information if the DictionaryAnnotator would help me more
> then? And if so is it as fast as the conceptmapper?
> 
> Thanks for clarification
> 
> Andreas
> 
> Am 20.03.2013 14:16, schrieb Michael Tanenblatt:
>> One thing that looks odd to me is that each entry is the "1. " prefix. 
>> Perhaps that is causing a problem, if the tokenizer is putting a sentence 
>> break at that point? Just a guess. 
>> 
>> 
>> On Mar 20, 2013, at 8:38 AM, Andreas Niekler 
>> <[email protected]> wrote:
>> 
>>> This is how my dict looks like:
>>> 
>>> <?xml version="1.0" encoding="UTF-8" ?>
>>> <synonym>
>>> <token canonical="mwu" SemClass="mwu">
>>> <variant base="1. FC Straubing"/>
>>> <variant base="1. FC Styrum"/>
>>> <variant base="1. FC Tatran Presov"/>
>>> <variant base="1. FC Tatran Prešov"/>
>>> <variant base="1. FC Trogen"/>
>>> <variant base="1. FC Union"/>
>>> <variant base="1. FC Union Berlin"/>
>>> <variant base="1. FC Union Solingen"/>
>>> <variant base="1. FC Viersen"/>
>>> <variant base="1. FC Viersen 05"/>
>>> <variant base="1. FC Vöcklabruck"/>
>>> <variant base="1. FC Weißenfels"/>
>>> <variant base="1. FC Wernigerode"/>
>>> <variant base="1. FC Wilmersdorf"/>
>>> <variant base="1. FC Windeck"/>
>>> <variant base="1. FC Wolfsburg"/>
>>> <variant base="1. FC Wunstorf"/>
>>> <variant base="1. FC Zeitz"/>
>>> <variant base="1. FFC"/>
>>> <variant base="1. FFC 08 Niederkirchen"/>
>>> <variant base="1. FFC Fortuna Dresden-Rähnitz"/>
>>> <variant base="1. FFC Frankfurt"/>
>>> <variant base="1. FFC Montabaur"/>
>>> </token>
>>> </synonym>
>>> 
>>> Am 20.03.2013 12:26, schrieb Michael Tanenblatt:
>>>> I have never seen this issue--under no circumstances should anything less 
>>>> than the full dictionary entry be matched. The only things I can think of 
>>>> are either errors in the dictionary, though that's unlikely, or issues 
>>>> with the tokenizer. Or a bug… My guess is that the dictionary entry, "FC 
>>>> Barcelona" is being tokenized such that only "FC" is annotated, therefore 
>>>> that is the only part that needs to match. You can test if it is a 
>>>> tokenization issue by using the sample whitespace tokenizer that comes 
>>>> with ConceptMapper just to test and see what results you get.
>>>> 
>>>> 
>>>> On Mar 20, 2013, at 7:09 AM, Andreas Niekler 
>>>> <[email protected]> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> i try to use the ConceptMapper to annotate Multi Word Units in german. I
>>>>> face the problem that all the tokens within the dictionary are matched
>>>>> somehow like.
>>>>> 
>>>>> In the dict -> FC Barcelona
>>>>> 
>>>>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry
>>>>> 
>>>>> Why does conceptMapper annotate this. Here are my Parameters
>>>>> 
>>>>> AnalysisEngineDescription mapper =
>>>>> AnalysisEngineFactory.createPrimitiveDescription(
>>>>>                           ConceptMapper.class,
>>>>>                           ts,
>>>>>                           ConceptMapper.PARAM_ANNOTATION_NAME,
>>>>> "org.apache.uima.conceptMapper.DictTerm",
>>>>>                   ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan",
>>>>>                   ConceptMapper.PARAM_TOKENANNOTATION, 
>>>>> "opennlp.uima.Token",
>>>>>                   ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] 
>>>>> {"canonical"},
>>>>>                   ConceptMapper.PARAM_FEATURE_LIST, new String[] 
>>>>> {"DictCanon"},                   
>>>>>                   ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText",
>>>>>                   ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, 
>>>>> "TokenizerDE.xml",
>>>>>                   //ConceptMapper.PARAM_DATA_BLOCK_FS, 
>>>>> "uima.tcas.DocumentAnnotation",
>>>>>                   ConceptMapper.PARAM_DATA_BLOCK_FS, 
>>>>> "opennlp.uima.Sentence",
>>>>>                   ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch",
>>>>>                   ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, 
>>>>> "matchedTokens",
>>>>>                   TokenNormalizer.PARAM_CASE_MATCH, "ignoreall");
>>>>> 
>>>>> Thank you
>>>>> 
>>>>> Andreas
>>>> 
>>>> 
>>> 
>>> -- 
>>> Andreas Niekler, Dipl. Ing. (FH)
>>> NLP Group | Department of Computer Science
>>> University of Leipzig
>>> Johannisgasse 26 | 04103 Leipzig
>>> 
>>> mail: [email protected]
>> 
>> 
> 
> -- 
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
> 
> mail: [email protected]

Reply via email to