One thing that looks odd to me is that each entry is the "1. " prefix. Perhaps 
that is causing a problem, if the tokenizer is putting a sentence break at that 
point? Just a guess. 


On Mar 20, 2013, at 8:38 AM, Andreas Niekler 
<[email protected]> wrote:

> This is how my dict looks like:
> 
> <?xml version="1.0" encoding="UTF-8" ?>
> <synonym>
> <token canonical="mwu" SemClass="mwu">
> <variant base="1. FC Straubing"/>
> <variant base="1. FC Styrum"/>
> <variant base="1. FC Tatran Presov"/>
> <variant base="1. FC Tatran Prešov"/>
> <variant base="1. FC Trogen"/>
> <variant base="1. FC Union"/>
> <variant base="1. FC Union Berlin"/>
> <variant base="1. FC Union Solingen"/>
> <variant base="1. FC Viersen"/>
> <variant base="1. FC Viersen 05"/>
> <variant base="1. FC Vöcklabruck"/>
> <variant base="1. FC Weißenfels"/>
> <variant base="1. FC Wernigerode"/>
> <variant base="1. FC Wilmersdorf"/>
> <variant base="1. FC Windeck"/>
> <variant base="1. FC Wolfsburg"/>
> <variant base="1. FC Wunstorf"/>
> <variant base="1. FC Zeitz"/>
> <variant base="1. FFC"/>
> <variant base="1. FFC 08 Niederkirchen"/>
> <variant base="1. FFC Fortuna Dresden-Rähnitz"/>
> <variant base="1. FFC Frankfurt"/>
> <variant base="1. FFC Montabaur"/>
> </token>
> </synonym>
> 
> Am 20.03.2013 12:26, schrieb Michael Tanenblatt:
>> I have never seen this issue--under no circumstances should anything less 
>> than the full dictionary entry be matched. The only things I can think of 
>> are either errors in the dictionary, though that's unlikely, or issues with 
>> the tokenizer. Or a bug… My guess is that the dictionary entry, "FC 
>> Barcelona" is being tokenized such that only "FC" is annotated, therefore 
>> that is the only part that needs to match. You can test if it is a 
>> tokenization issue by using the sample whitespace tokenizer that comes with 
>> ConceptMapper just to test and see what results you get.
>> 
>> 
>> On Mar 20, 2013, at 7:09 AM, Andreas Niekler 
>> <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> i try to use the ConceptMapper to annotate Multi Word Units in german. I
>>> face the problem that all the tokens within the dictionary are matched
>>> somehow like.
>>> 
>>> In the dict -> FC Barcelona
>>> 
>>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry
>>> 
>>> Why does conceptMapper annotate this. Here are my Parameters
>>> 
>>> AnalysisEngineDescription mapper =
>>> AnalysisEngineFactory.createPrimitiveDescription(
>>>                             ConceptMapper.class,
>>>                             ts,
>>>                             ConceptMapper.PARAM_ANNOTATION_NAME,
>>> "org.apache.uima.conceptMapper.DictTerm",
>>>                     ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan",
>>>                     ConceptMapper.PARAM_TOKENANNOTATION, 
>>> "opennlp.uima.Token",
>>>                     ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] 
>>> {"canonical"},
>>>                     ConceptMapper.PARAM_FEATURE_LIST, new String[] 
>>> {"DictCanon"},                   
>>>                     ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText",
>>>                     ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, 
>>> "TokenizerDE.xml",
>>>                     //ConceptMapper.PARAM_DATA_BLOCK_FS, 
>>> "uima.tcas.DocumentAnnotation",
>>>                     ConceptMapper.PARAM_DATA_BLOCK_FS, 
>>> "opennlp.uima.Sentence",
>>>                     ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch",
>>>                     ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, 
>>> "matchedTokens",
>>>                     TokenNormalizer.PARAM_CASE_MATCH, "ignoreall");
>>> 
>>> Thank you
>>> 
>>> Andreas
>> 
>> 
> 
> -- 
> Andreas Niekler, Dipl. Ing. (FH)
> NLP Group | Department of Computer Science
> University of Leipzig
> Johannisgasse 26 | 04103 Leipzig
> 
> mail: [email protected]

Reply via email to