One thing that looks odd to me is that each entry is the "1. " prefix. Perhaps that is causing a problem, if the tokenizer is putting a sentence break at that point? Just a guess.
On Mar 20, 2013, at 8:38 AM, Andreas Niekler <[email protected]> wrote: > This is how my dict looks like: > > <?xml version="1.0" encoding="UTF-8" ?> > <synonym> > <token canonical="mwu" SemClass="mwu"> > <variant base="1. FC Straubing"/> > <variant base="1. FC Styrum"/> > <variant base="1. FC Tatran Presov"/> > <variant base="1. FC Tatran Prešov"/> > <variant base="1. FC Trogen"/> > <variant base="1. FC Union"/> > <variant base="1. FC Union Berlin"/> > <variant base="1. FC Union Solingen"/> > <variant base="1. FC Viersen"/> > <variant base="1. FC Viersen 05"/> > <variant base="1. FC Vöcklabruck"/> > <variant base="1. FC Weißenfels"/> > <variant base="1. FC Wernigerode"/> > <variant base="1. FC Wilmersdorf"/> > <variant base="1. FC Windeck"/> > <variant base="1. FC Wolfsburg"/> > <variant base="1. FC Wunstorf"/> > <variant base="1. FC Zeitz"/> > <variant base="1. FFC"/> > <variant base="1. FFC 08 Niederkirchen"/> > <variant base="1. FFC Fortuna Dresden-Rähnitz"/> > <variant base="1. FFC Frankfurt"/> > <variant base="1. FFC Montabaur"/> > </token> > </synonym> > > Am 20.03.2013 12:26, schrieb Michael Tanenblatt: >> I have never seen this issue--under no circumstances should anything less >> than the full dictionary entry be matched. The only things I can think of >> are either errors in the dictionary, though that's unlikely, or issues with >> the tokenizer. Or a bug… My guess is that the dictionary entry, "FC >> Barcelona" is being tokenized such that only "FC" is annotated, therefore >> that is the only part that needs to match. You can test if it is a >> tokenization issue by using the sample whitespace tokenizer that comes with >> ConceptMapper just to test and see what results you get. >> >> >> On Mar 20, 2013, at 7:09 AM, Andreas Niekler >> <[email protected]> wrote: >> >>> Hello, >>> >>> i try to use the ConceptMapper to annotate Multi Word Units in german. I >>> face the problem that all the tokens within the dictionary are matched >>> somehow like. >>> >>> In the dict -> FC Barcelona >>> >>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry >>> >>> Why does conceptMapper annotate this. Here are my Parameters >>> >>> AnalysisEngineDescription mapper = >>> AnalysisEngineFactory.createPrimitiveDescription( >>> ConceptMapper.class, >>> ts, >>> ConceptMapper.PARAM_ANNOTATION_NAME, >>> "org.apache.uima.conceptMapper.DictTerm", >>> ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan", >>> ConceptMapper.PARAM_TOKENANNOTATION, >>> "opennlp.uima.Token", >>> ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] >>> {"canonical"}, >>> ConceptMapper.PARAM_FEATURE_LIST, new String[] >>> {"DictCanon"}, >>> ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText", >>> ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, >>> "TokenizerDE.xml", >>> //ConceptMapper.PARAM_DATA_BLOCK_FS, >>> "uima.tcas.DocumentAnnotation", >>> ConceptMapper.PARAM_DATA_BLOCK_FS, >>> "opennlp.uima.Sentence", >>> ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch", >>> ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, >>> "matchedTokens", >>> TokenNormalizer.PARAM_CASE_MATCH, "ignoreall"); >>> >>> Thank you >>> >>> Andreas >> >> > > -- > Andreas Niekler, Dipl. Ing. (FH) > NLP Group | Department of Computer Science > University of Leipzig > Johannisgasse 26 | 04103 Leipzig > > mail: [email protected]
