Any tokenizer should work--I only suggested using the whitespace tokenizer as a way to see if the tokenizer you are using may be the problem in this case. As to the DictionaryAnnotator, I really can't say much, since I haven't used it.
On Mar 20, 2013, at 10:56 AM, Andreas Niekler <[email protected]> wrote: > I will investigate if this is the case. I will try to only use a > whitespace Tokenizer. > > Is there any information if the DictionaryAnnotator would help me more > then? And if so is it as fast as the conceptmapper? > > Thanks for clarification > > Andreas > > Am 20.03.2013 14:16, schrieb Michael Tanenblatt: >> One thing that looks odd to me is that each entry is the "1. " prefix. >> Perhaps that is causing a problem, if the tokenizer is putting a sentence >> break at that point? Just a guess. >> >> >> On Mar 20, 2013, at 8:38 AM, Andreas Niekler >> <[email protected]> wrote: >> >>> This is how my dict looks like: >>> >>> <?xml version="1.0" encoding="UTF-8" ?> >>> <synonym> >>> <token canonical="mwu" SemClass="mwu"> >>> <variant base="1. FC Straubing"/> >>> <variant base="1. FC Styrum"/> >>> <variant base="1. FC Tatran Presov"/> >>> <variant base="1. FC Tatran Prešov"/> >>> <variant base="1. FC Trogen"/> >>> <variant base="1. FC Union"/> >>> <variant base="1. FC Union Berlin"/> >>> <variant base="1. FC Union Solingen"/> >>> <variant base="1. FC Viersen"/> >>> <variant base="1. FC Viersen 05"/> >>> <variant base="1. FC Vöcklabruck"/> >>> <variant base="1. FC Weißenfels"/> >>> <variant base="1. FC Wernigerode"/> >>> <variant base="1. FC Wilmersdorf"/> >>> <variant base="1. FC Windeck"/> >>> <variant base="1. FC Wolfsburg"/> >>> <variant base="1. FC Wunstorf"/> >>> <variant base="1. FC Zeitz"/> >>> <variant base="1. FFC"/> >>> <variant base="1. FFC 08 Niederkirchen"/> >>> <variant base="1. FFC Fortuna Dresden-Rähnitz"/> >>> <variant base="1. FFC Frankfurt"/> >>> <variant base="1. FFC Montabaur"/> >>> </token> >>> </synonym> >>> >>> Am 20.03.2013 12:26, schrieb Michael Tanenblatt: >>>> I have never seen this issue--under no circumstances should anything less >>>> than the full dictionary entry be matched. The only things I can think of >>>> are either errors in the dictionary, though that's unlikely, or issues >>>> with the tokenizer. Or a bug… My guess is that the dictionary entry, "FC >>>> Barcelona" is being tokenized such that only "FC" is annotated, therefore >>>> that is the only part that needs to match. You can test if it is a >>>> tokenization issue by using the sample whitespace tokenizer that comes >>>> with ConceptMapper just to test and see what results you get. >>>> >>>> >>>> On Mar 20, 2013, at 7:09 AM, Andreas Niekler >>>> <[email protected]> wrote: >>>> >>>>> Hello, >>>>> >>>>> i try to use the ConceptMapper to annotate Multi Word Units in german. I >>>>> face the problem that all the tokens within the dictionary are matched >>>>> somehow like. >>>>> >>>>> In the dict -> FC Barcelona >>>>> >>>>> Annotated in a Text "The FC scored today" FC is annotated as DictEntry >>>>> >>>>> Why does conceptMapper annotate this. Here are my Parameters >>>>> >>>>> AnalysisEngineDescription mapper = >>>>> AnalysisEngineFactory.createPrimitiveDescription( >>>>> ConceptMapper.class, >>>>> ts, >>>>> ConceptMapper.PARAM_ANNOTATION_NAME, >>>>> "org.apache.uima.conceptMapper.DictTerm", >>>>> ConceptMapper.PARAM_ENCLOSINGSPAN, "enclosingSpan", >>>>> ConceptMapper.PARAM_TOKENANNOTATION, >>>>> "opennlp.uima.Token", >>>>> ConceptMapper.PARAM_ATTRIBUTE_LIST, new String[] >>>>> {"canonical"}, >>>>> ConceptMapper.PARAM_FEATURE_LIST, new String[] >>>>> {"DictCanon"}, >>>>> ConceptMapper.PARAM_MATCHEDFEATURE, "matchedText", >>>>> ConceptMapper.PARAM_TOKENIZERDESCRIPTOR, >>>>> "TokenizerDE.xml", >>>>> //ConceptMapper.PARAM_DATA_BLOCK_FS, >>>>> "uima.tcas.DocumentAnnotation", >>>>> ConceptMapper.PARAM_DATA_BLOCK_FS, >>>>> "opennlp.uima.Sentence", >>>>> ConceptMapper.PARAM_SEARCHSTRATEGY, "ContiguousMatch", >>>>> ConceptMapper.PARAM_MATCHEDTOKENSFEATURENAME, >>>>> "matchedTokens", >>>>> TokenNormalizer.PARAM_CASE_MATCH, "ignoreall"); >>>>> >>>>> Thank you >>>>> >>>>> Andreas >>>> >>>> >>> >>> -- >>> Andreas Niekler, Dipl. Ing. (FH) >>> NLP Group | Department of Computer Science >>> University of Leipzig >>> Johannisgasse 26 | 04103 Leipzig >>> >>> mail: [email protected] >> >> > > -- > Andreas Niekler, Dipl. Ing. (FH) > NLP Group | Department of Computer Science > University of Leipzig > Johannisgasse 26 | 04103 Leipzig > > mail: [email protected]
