This is how my dict looks like: <?xml version="1.0" encoding="UTF-8" ?> <synonym> <token canonical="mwu" SemClass="mwu"> <variant base="1. FC Straubing"/> <variant base="1. FC Styrum"/> <variant base="1. FC Tatran Presov"/> <variant base="1. FC Tatran Prešov"/> <variant base="1. FC Trogen"/> <variant base="1. FC Union"/> <variant base="1. FC Union Berlin"/> <variant base="1. FC Union Solingen"/> <variant base="1. FC Viersen"/> <variant base="1. FC Viersen 05"/> <variant base="1. FC Vöcklabruck"/> <variant base="1. FC Weißenfels"/> <variant base="1. FC Wernigerode"/> <variant base="1. FC Wilmersdorf"/> <variant base="1. FC Windeck"/> <variant base="1. FC Wolfsburg"/> <variant base="1. FC Wunstorf"/> <variant base="1. FC Zeitz"/> <variant base="1. FFC"/> <variant base="1. FFC 08 Niederkirchen"/> <variant base="1. FFC Fortuna Dresden-Rähnitz"/> <variant base="1. FFC Frankfurt"/> <variant base="1. FFC Montabaur"/> </token> </synonym>
Am 20.03.2013 12:26, schrieb Michael Tanenblatt: > I have never seen this issue--under no circumstances should anything less > than the full dictionary entry be matched. The only things I can think of are > either errors in the dictionary, though that's unlikely, or issues with the > tokenizer. Or a bug… My guess is that the dictionary entry, "FC Barcelona" is > being tokenized such that only "FC" is annotated, therefore that is the only > part that needs to match. You can test if it is a tokenization issue by using > the sample whitespace tokenizer that comes with ConceptMapper just to test > and see what results you get. > > > On Mar 20, 2013, at 7:09 AM, Andreas Niekler > <[email protected]> wrote: > >> Hello, >> >> i try to use the ConceptMapper to annotate Multi Word Units in german. I >> face the problem that all the tokens within the dictionary are matched >> somehow like. >> >> In the dict -> FC Barcelona >> >> Annotated in a Text "The FC scored today" FC is annotated as DictEntry >> >> Why does conceptMapper annotate this. Here are my Parameters >> >> AnalysisEngineDescription mapper = >> AnalysisEngineFactory.createPrimitiveDescription( >> ConceptMapper.class, >> ts, >> ConceptMapper.PARAM_ANNOTATION_NAME, >> "org.apache.uima.conceptMapper.DictTerm", >> ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan", >> ConceptMapper.PARAM_TOKENANNOTATION, >> "opennlp.uima.Token", >> ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] >> {"canonical"}, >> ConceptMapper.PARAM_FEATURE_LIST, new String[] >> {"DictCanon"}, >> ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText", >> ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, >> "TokenizerDE.xml", >> //ConceptMapper.PARAM_DATA_BLOCK_FS, >> "uima.tcas.DocumentAnnotation", >> ConceptMapper.PARAM_DATA_BLOCK_FS, >> "opennlp.uima.Sentence", >> ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch", >> ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, >> "matchedTokens", >> TokenNormalizer.PARAM_CASE_MATCH, "ignoreall"); >> >> Thank you >> >> Andreas > > -- Andreas Niekler, Dipl. Ing. (FH) NLP Group | Department of Computer Science University of Leipzig Johannisgasse 26 | 04103 Leipzig mail: [email protected]
