Re: POS tagging in Lucene

2016-10-19 Thread Tommaso Teofili
I think it might be helpful to handle POS tags as TypeAttributes so that
the input and output texts would cleaner and you can still filter and
retrieve tokens by type (e.g. with TypeTokenFilter).

My 2 cents,
Tommaso


Il giorno mer 19 ott 2016 alle ore 11:56 Niki Pavlopoulou 
ha scritto:

> Hi Steve,
>
> thank you for your answer. I created a custom Lucene Analyser in the end.
> Just to clarify on what I mean, Lucene works perfectly for pure words, but
> since it does not support POS tagging some workaround needs to be done for
> the analysis of tokens with POS tags. For example:
>
> Input without POS tags: "I love Lucene's library. It is perfect."
> Output: List(love, lucene, library, perfect)
>
> Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
> is[VBZ] perfect[JJ]"
> Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
> is[vbz], perfect[jj])
> *Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])
>
> If one does the POS tagging after the analysis, then the tags might be
> wrong as the right syntax has been lost. This is why the POS tagging needs
> to happen early on and then the analysis to take place.
>
> Regards,
> Niki.
>
> On 18 October 2016 at 19:59, Steve Rowe  wrote:
>
> > Hi Niki,
> >
> > > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> > >
> > > Hi all,
> > >
> > > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > > biGrams with POS tags as well. For example, I would like something like
> > > that:
> > >
> > > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> > >
> > > The problem above is that I do not have "pure" tokens, like "I", "am"
> > etc.,
> > > so the analysis could be wrong if I add the POS tags as an input in
> > Lucene.
> > > Is there a way to solve this, apart from creating my custome Lucene
> > > analyser?
> >
> > To create your bigrams, check out ShingleFilter: <
> > http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> > apache/lucene/analysis/shingle/ShingleFilter.html>
> >
> > I’m not sure what you mean by “the analysis could be wrong if I add the
> > POS tags as an input in Lucene” - can you give an example?
> >
> > You may be interested in the work-in-progress addition of OpenNLP
> > integration with Lucene here:  > jira/browse/LUCENE-2899>
> >
> > --
> > Steve
> > www.lucidworks.com
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>


Re: POS tagging in Lucene

2016-10-19 Thread Niki Pavlopoulou
Hi Steve,

thank you for your answer. I created a custom Lucene Analyser in the end.
Just to clarify on what I mean, Lucene works perfectly for pure words, but
since it does not support POS tagging some workaround needs to be done for
the analysis of tokens with POS tags. For example:

Input without POS tags: "I love Lucene's library. It is perfect."
Output: List(love, lucene, library, perfect)

Input with POS tags: "I[PRP] love[VBP] Lucene's[NNP] library[NN] It[PRP]
is[VBZ] perfect[JJ]"
Output: List(i[prp], love[vbp], lucene's[nnp], library[nn], it[prp],
is[vbz], perfect[jj])
*Desired output*: List(love[vbp], lucene[nnp], library[nn], perfect[jj])

If one does the POS tagging after the analysis, then the tags might be
wrong as the right syntax has been lost. This is why the POS tagging needs
to happen early on and then the analysis to take place.

Regards,
Niki.

On 18 October 2016 at 19:59, Steve Rowe  wrote:

> Hi Niki,
>
> > On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> >
> > Hi all,
> >
> > I am using Lucene and OpenNLP for POS tagging. I would like to support
> > biGrams with POS tags as well. For example, I would like something like
> > that:
> >
> > Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> > Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> >
> > The problem above is that I do not have "pure" tokens, like "I", "am"
> etc.,
> > so the analysis could be wrong if I add the POS tags as an input in
> Lucene.
> > Is there a way to solve this, apart from creating my custome Lucene
> > analyser?
>
> To create your bigrams, check out ShingleFilter: <
> http://lucene.apache.org/core/6_2_1/analyzers-common/org/
> apache/lucene/analysis/shingle/ShingleFilter.html>
>
> I’m not sure what you mean by “the analysis could be wrong if I add the
> POS tags as an input in Lucene” - can you give an example?
>
> You may be interested in the work-in-progress addition of OpenNLP
> integration with Lucene here:  jira/browse/LUCENE-2899>
>
> --
> Steve
> www.lucidworks.com
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: POS tagging in Lucene

2016-10-18 Thread Steve Rowe
Hi Niki,

> On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> 
> Hi all,
> 
> I am using Lucene and OpenNLP for POS tagging. I would like to support
> biGrams with POS tags as well. For example, I would like something like
> that:
> 
> Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> 
> The problem above is that I do not have "pure" tokens, like "I", "am" etc.,
> so the analysis could be wrong if I add the POS tags as an input in Lucene.
> Is there a way to solve this, apart from creating my custome Lucene
> analyser?

To create your bigrams, check out ShingleFilter: 
<http://lucene.apache.org/core/6_2_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>

I’m not sure what you mean by “the analysis could be wrong if I add the POS 
tags as an input in Lucene” - can you give an example?

You may be interested in the work-in-progress addition of OpenNLP integration 
with Lucene here: 

--
Steve
www.lucidworks.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org