Hello, I have similar issues in the Portuguese corpus.
I tried two approaches: 1) I trained an OpenNLP Name Finder model with the MWE extracted from the corpus; 2) I divided the MWE in separated tokens while training the POS Tagger, for example "por_favor_intj" --> "por_B-intj favor_I-intj" Both approaches are OK in my system, I evaluated both but I don't have it right now. l can post the values here another day. I prefer the second method because it is one module less, but on the other hand we increase the number of outcomes. William On Fri, Jun 22, 2012 at 6:13 AM, Nicolas Hernandez < [email protected]> wrote: > On Thu, Jun 21, 2012 at 5:55 PM, Jörn Kottmann <[email protected]> wrote: > > To make this reliable the tokenization on new unseen > > text must be done correctly. For the spanish data we had > > a special chunker to put the multi word units into one token. > > > > Do you use something like that? > > A kind of. > > A POS tagger should always be used with the tokenizer which was used > for the POS tagger trainer. > The more you manage to automatically recognize multi word units, the > more you should consider them for your training. It makes easier the > further syntax analysis. > (interaction can exist between the two processes (i.e. recognizing > multi words units can require a pos tagging of simple words) but I won > t speak about that). > > In the FTB, the notion of multi-word expression (compound) is a bit > fuzzy and it is not simple to automatically process the same > tokenization which is assumed. > > > > > What do you think about outputting a special pos tag to indicate > > that it is a multi-word tag? > > I do not see the point at the pos analysis level. We must try to > re-use what it exists to preserve the compatibility. > Here I should mention that [Green:2011:MEI:2145432.2145516] (see > below) introduced multi-word tags (one for each pos label mw noun, mw > verb, mw preposition...) at the chunk and syntax levels. This is > probably a nice idea. > > > The cli should offer a way to specify what are multi-word expressions > in the data. > This can be done by using a parameter to set what is the token > separator character. > > Models built from the cli or the API should be the same. > One way to do that is to use a parameter to set what is the multi-word > separator character and to turn this separator character into > whitespace before training the model. > For example with " " as the token separator character, "_" as the > multi-word separator character and "/" as the pos tag separator, the > following sentence > Nico/NNP wants/VBP to/TO get/VB to/TO bed/NN earlier/RB in_order_to/IN > sleep/VB longer/RB > should be turn into > String[] sentence = {"Nico", "wants", "to", "get", "to", "bed", > "earlier", "in order to", "sleep", "longer"}; > (note "in order to") > What do you think about ? > > Best > > /Nicolas > > > @inproceedings{Green:2011:MEI:2145432.2145516, > author = {Green, Spence and de Marneffe, Marie-Catherine and Bauer, > John and Manning, Christopher D.}, > title = {Multiword expression identification with tree substitution > grammars: a parsing tour de force with French}, > booktitle = {Proceedings of the Conference on Empirical Methods in > Natural Language Processing}, > series = {EMNLP '11}, > year = {2011}, > isbn = {978-1-937284-11-4}, > location = {Edinburgh, United Kingdom}, > pages = {725--735}, > numpages = {11}, > url = {http://dl.acm.org/citation.cfm?id=2145432.2145516}, > acmid = {2145516}, > publisher = {Association for Computational Linguistics}, > address = {Stroudsburg, PA, USA}, > } > > > > > > Jörn > > > > > > On 06/21/2012 04:47 PM, Nicolas Hernandez wrote: > >> > >> Hi Jörn > >> > >> On Thu, Jun 21, 2012 at 9:50 AM, Jörn Kottmann<[email protected]> > wrote: > >>> > >>> Hello, > >>> > >>> the lexical unit in the POS Tagger is a token. For the > >>> spanish POS models mutli-token chunks were converted > >>> into one token separated by a "_". > >>> > >>> To what would you set the lexical unit separator in your case? > >> > >> I do the same, but ... I m a bit confused of doing that because > >> 1. I do not like pre and post process my data (here to add/remove an > >> underscore to the multi-words terms) > >> 2. A model trained with the API, which allows you not to preprocess > >> your data, will be different from the model trained with the cli on > >> the same data > >> 3. Finally when you get a model you do not know which segmentation it > >> assumes and how the multi-words terms are represented > >> > >> Since it is often convenient to use the cli it would be nice to set > >> the token separator at least to be able to build the same models than > >> with the API. > >> > >>> The pos tag separator can already be configured in the class > >>> which reads the input, but this parameter is not be set by the cli > >>> tool. > >>> > >>> +1 to make both configurable from the command line. > >> > >> Nice. > >> > >> At least the idea has been proposed. If I have time... > >> > >>> Jörn > >>> > >>> > >>> On 06/20/2012 03:02 PM, Nicolas Hernandez wrote: > >>>> > >>>> Hi Everyone > >>>> > >>>> I need to train the POS tagger on multi-word terms. In other words, > >>>> some of my lexical units are made several tokens separated by > >>>> whitespace characters (like "traffic light", "feu rouge", "in order > >>>> to", ...). > >>>> > >>>> I thing the training API allows to handle that but the command line > >>>> tools cannot. The former takes the words of a sentence as an array of > >>>> string. The latter assumes that the whitespace character is the > >>>> lexical unit separator. > >>>> A convention like concatenating all the words which are part of a > >>>> multi word term is not a solution since in that case models built by > >>>> the command line and by the API will be different. > >>>> > >>>> It would be great if we could set by parameter what is the lexical > >>>> unit separator as well pos tag separator. > >>>> > >>>> What do you think ? > >>>> > >>>> /Nicolas > >>>> > >>>> [1] > >>>> > >>>> > http://incubator.apache.org/opennlp/documentation/manual/opennlp.html#tools.postagger.tagging.api > ) > >>> > >>> > >> > >> > > > > > > -- > Dr. Nicolas Hernandez > Associate Professor (Maître de Conférences) > Université de Nantes - LINA CNRS UMR 6241 > http://enicolashernandez.blogspot.com > http://www.univ-nantes.fr/hernandez-n > +33 (0)2 51 12 53 94 > +33 (0)2 40 30 60 67 >
