On Wed, Jun 27, 2012 at 3:40 PM, Jörn Kottmann <[email protected]> wrote: > On 06/27/2012 03:07 PM, Nicolas Hernandez wrote: >> >> On Wed, Jun 27, 2012 at 10:03 AM, Jörn Kottmann <[email protected]> >> wrote: >> >>> >>> That would move the responsibility to detect multi-word-units >>> to the POS Tagger. >>> >>> A simple transformation step could convert the output to the >>> non-multi-word-tags format. >>> >>> This has the advantage that a user can detect a multi-word unit. >> >> The "special pos tags" would be for the whole multi-word expressions >> (MWE) or for each word of the MWE ? >> >> Anyway, according to me, the pos tagger trainer should not be aware of >> that and we must not force the users to change their tagset to use the >> opennlp tools. >> If the data is annotated with pos tags which inform about MWE then >> they will be handled like any tags over simple words. > > > Yes that should just work with our current way of handling the data. > A user even has the option with the current implementation to customize > the POS Tagger to handle it differently, even including a custom > MWE detector model in the pos model package would be possible. > > That should be fine as it is. > >>>> The cli should offer a way to specify what are multi-word expressions >>>> in the data. >>>> This can be done by using a parameter to set what is the token >>>> separator character. >>>> >>>> Models built from the cli or the API should be the same. >>>> One way to do that is to use a parameter to set what is the multi-word >>>> separator character and to turn this separator character into >>>> whitespace before training the model. >>>> For example with " " as the token separator character, "_" as the >>>> multi-word separator character and "/" as the pos tag separator, the >>>> following sentence >>>> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN >>>> sleep/VB longer/RB >>>> should be turn into >>>> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed", >>>> "earlier", "in order to", "sleep", "longer"}; >>>> (note "in order to") >>>> What do you think about ? >>>> >>>> >>> I think that could be problematic if your training or test data contains >>> the multi-word-separator character. In this case you might consider >>> something >>> as a multi-word-unit which should not be one. >>> What do you think about using SGML style tags as we do in the NER >>> training >>> format? >>> For example: <MW>in order to</MW>/IN. >> >> I do not like to mix the annotation systems: either everything in XML >> (<MW pos="IN">in order to</MW>) or not. >> The multi-word separator can be a string not a character (e.g."_##_" >> which is quite rare). The point is that the user should be informed >> about the problem you mention and since it is up to him to set the >> string by parameter, he will do an aware and wise choice. >> If you mix both annotations systems, then the ambiguity problem >> remains also for the start tag and end tag you use. >> >>> Would you prefer, dealing with multi-word-units at the tokenizer level >>> or at the POS Tagger level? >>> Or do we need support for both? >> >> >> I think we agree that whatever the analyzers we use (POS, Chunk, >> Parser, NER...), all should have been built on data word-tokenized in >> the same way. > > > If a token can be a MWE then we need to fix all our formats to > support the MWE separator char sequence. Currently we use the whitespace > tokenizer in most of the places to process our training data. > That we should change and use one which is sensitive to MWEs separated by > a char sequence. > > We should specify a default MWE separator which is used when serializing > data with MWEs > and offer an option to specify it. > > >> Personally I do not use the OpenNLP word tokenizer. Actually I used it >> but I also use dictionaries and regex expressions which lead me to a >> richer concept of "word" (what it is called "lexical unit"). I take >> them as input of my processing chain. >> >> And since I used also UIMA and since UIMA process annotations. If the >> annotation stands for "lexical unit", some can be MWE or simple word, >> it is transparent for the user. >> >> So I would have like that the OpenNLP pos tagger/trainer cli offer me >> a way to build models I can use with UIMA without pre/post processing. >> >> In my opinion, an OpenNLP labeller/trainer should offer the >> possibility to the users of adapting its input/output to the users' >> data and not the opposite. > > > Yes, I see this in the same way. > > Would you mind to open a jira issue to request MWE support in our > training formats? > > Jörn >
done https://issues.apache.org/jira/browse/OPENNLP-515
