Re: Training the pos tagger on "multi whitespace-separated tokens" terms

William Colen Fri, 22 Jun 2012 20:52:31 -0700

Hello,

I have similar issues in the Portuguese corpus.


I tried two approaches:

1) I trained an OpenNLP Name Finder model with the MWE extracted from the
corpus;
2) I divided the MWE in separated tokens while training the POS Tagger, for
example "por_favor_intj" --> "por_B-intj favor_I-intj"

Both approaches are OK in my system, I evaluated both but I don't have it
right now. l can post the values here another day. I prefer the second
method because it is one module less, but on the other hand we increase the
number of outcomes.

William


On Fri, Jun 22, 2012 at 6:13 AM, Nicolas Hernandez <
[email protected]> wrote:

> On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote:
> > To make this reliable the tokenization on new unseen
> > text must be done correctly. For the spanish data we had
> > a special chunker to put the multi word units into one token.
> >
> > Do you use something like that?
>
> A kind of.
>
> A POS tagger should always be used with the tokenizer which was used
> for the POS tagger trainer.
> The more you manage to automatically recognize multi word units, the
> more you should consider them for your training. It makes easier the
> further syntax analysis.
> (interaction can exist between the two processes (i.e. recognizing
> multi words units can require a pos tagging of simple words) but I won
> t speak about that).
>
> In the FTB, the notion of multi-word expression (compound) is a bit
> fuzzy and it is not simple to automatically process the same
> tokenization which is assumed.
>
> >
> > What do you think about outputting a special pos tag to indicate
> > that it is a multi-word tag?
>
> I do not see the point at the pos analysis level. We must try to
> re-use what it exists to preserve the compatibility.
> Here I should mention that [Green:2011:MEI:2145432.2145516] (see
> below) introduced multi-word tags (one for each pos label mw noun, mw
> verb, mw preposition...) at the chunk and syntax levels. This is
> probably a nice idea.
>
>
> The cli should offer a way to specify what are multi-word expressions
> in the data.
> This can be done by using a parameter to set what is the token
> separator character.
>
> Models built from the cli or the API should be the same.
> One way to do that is to use a parameter to set what is the multi-word
> separator character and to turn this separator character into
> whitespace before training the model.
> For example with " " as the token separator character, "_" as the
> multi-word separator character and "/" as the pos tag separator, the
> following sentence
> Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN
> sleep/VB longer/RB
> should be turn into
> String[] sentence = {"Nico", "wants", "to", "get", "to", "bed",
> "earlier", "in order to", "sleep", "longer"};
> (note "in order to")
> What do you think about ?
>
> Best
>
> /Nicolas
>
>
> @inproceedings{Green:2011:MEI:2145432.2145516,
>  author = {Green, Spence and de Marneffe, Marie-Catherine and Bauer,
> John and Manning, Christopher D.},
>  title = {Multiword expression identification with tree substitution
> grammars: a parsing tour de force with French},
>  booktitle = {Proceedings of the Conference on Empirical Methods in
> Natural Language Processing},
>  series = {EMNLP '11},
>  year = {2011},
>  isbn = {978-1-937284-11-4},
>  location = {Edinburgh, United Kingdom},
>  pages = {725--735},
>  numpages = {11},
>  url = {http://dl.acm.org/citation.cfm?id=2145432.2145516},
>  acmid = {2145516},
>  publisher = {Association for Computational Linguistics},
>  address = {Stroudsburg, PA, USA},
> }
>
>
> >
> > Jörn
> >
> >
> > On 06/21/2012 04:47 PM, Nicolas Hernandez wrote:
> >>
> >> Hi Jörn
> >>
> >> On Thu, Jun 21, 2012 at 9:50 AM, Jörn Kottmann<[email protected]>
>  wrote:
> >>>
> >>> Hello,
> >>>
> >>> the lexical unit in the POS Tagger is a token. For the
> >>> spanish POS models mutli-token chunks were converted
> >>> into one token separated by a "_".
> >>>
> >>> To what would you set the lexical unit separator in your case?
> >>
> >> I do the same, but ... I m a bit confused of doing that because
> >> 1. I do not like pre and post process my data (here to add/remove an
> >> underscore to the multi-words terms)
> >> 2. A model trained with the API, which allows you not to preprocess
> >> your data, will be different from the model trained with the cli on
> >> the same data
> >> 3. Finally when you get a model you do not know which segmentation it
> >> assumes and how the multi-words terms are represented
> >>
> >> Since it is often convenient to use the cli it would be nice to set
> >> the token separator at least to be able to build the same models than
> >> with the API.
> >>
> >>> The pos tag separator can already be configured in the class
> >>> which reads the input, but this parameter is not be set by the cli
> >>> tool.
> >>>
> >>> +1 to make both configurable from the command line.
> >>
> >> Nice.
> >>
> >> At least the idea has been proposed. If I have time...
> >>
> >>> Jörn
> >>>
> >>>
> >>> On 06/20/2012 03:02 PM, Nicolas Hernandez wrote:
> >>>>
> >>>> Hi Everyone
> >>>>
> >>>> I need to train the POS tagger on multi-word terms. In other words,
> >>>> some of my lexical units are made several tokens separated by
> >>>> whitespace characters (like "traffic light", "feu rouge", "in order
> >>>> to", ...).
> >>>>
> >>>> I thing the training API allows to handle that but the command line
> >>>> tools cannot. The former takes the words of a sentence as an array of
> >>>> string. The latter assumes that the whitespace character is the
> >>>> lexical unit separator.
> >>>> A convention like concatenating all the words which are part of a
> >>>> multi word term is not a solution since in that case models built by
> >>>> the command line and by the API will be different.
> >>>>
> >>>> It would be great if we could set by parameter what is the lexical
> >>>> unit separator as well pos tag separator.
> >>>>
> >>>> What do you think ?
> >>>>
> >>>> /Nicolas
> >>>>
> >>>> [1]
> >>>>
> >>>>
> http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api
> )
> >>>
> >>>
> >>
> >>
> >
>
>
>
> --
> Dr. Nicolas Hernandez
> Associate Professor (Maître de Conférences)
> Université de Nantes - LINA CNRS UMR 6241
> http://enicolashernandez.blogspot.com
> http://www.univ-nantes.fr/hernandez-n
> +33 (0)2 51 12 53 94
> +33 (0)2 40 30 60 67
>

Re: Training the pos tagger on "multi whitespace-separated tokens" terms

Reply via email to