Re: POS tagging in Lucene

2016-10-18 Thread Steve Rowe
Hi Niki,

> On Oct 18, 2016, at 7:27 AM, Niki Pavlopoulou  wrote:
> 
> Hi all,
> 
> I am using Lucene and OpenNLP for POS tagging. I would like to support
> biGrams with POS tags as well. For example, I would like something like
> that:
> 
> Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
> Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])
> 
> The problem above is that I do not have "pure" tokens, like "I", "am" etc.,
> so the analysis could be wrong if I add the POS tags as an input in Lucene.
> Is there a way to solve this, apart from creating my custome Lucene
> analyser?

To create your bigrams, check out ShingleFilter: 
<http://lucene.apache.org/core/6_2_1/analyzers-common/org/apache/lucene/analysis/shingle/ShingleFilter.html>

I’m not sure what you mean by “the analysis could be wrong if I add the POS 
tags as an input in Lucene” - can you give an example?

You may be interested in the work-in-progress addition of OpenNLP integration 
with Lucene here: 

--
Steve
www.lucidworks.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to add ASCIIFoldingFilter in ClassicAnalyzer

2016-10-18 Thread Kumaran Ramasubramanian
Hi Adrien

How to do this? Any Pointers?

​
> If it is fine to add the ascii folding filter at the end of the analysis

chain, then you could use AnalyzerWrapper. ​
>




​-
Kumaran R​









On Tue, Oct 11, 2016 at 9:59 PM, Kumaran Ramasubramanian  wrote:

>
>
> @Ahmet, Uwe: Thanks a lot for your suggestion. Already i have written
> custom analyzer as you said. But just trying to avoid new component in my
> search flow.
>
> @Adrien: how to add filter using AnalyzerWrapper. Any pointers?
>
>
>
>
>
>
>
>
>
> On Tue, Oct 11, 2016 at 8:16 PM, Uwe Schindler  wrote:
>
>> I'd suggest to use CustomAnalyzer for defining your own analyzer. This
>> allows to build your own analyzer with the components (tokenizers and
>> filters) you like to have.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Adrien Grand [mailto:jpou...@gmail.com]
>> > Sent: Tuesday, October 11, 2016 4:37 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Re: How to add ASCIIFoldingFilter in ClassicAnalyzer
>> >
>> > Hi Kumaran,
>> >
>> > If it is fine to add the ascii folding filter at the end of the analysis
>> > chain, then you could use AnalyzerWrapper. Otherwise, you need to
>> create a
>> > new analyzer that has the same analysis chain as ClassicAnalyzer, plus
>> an
>> > ASCIIFoldingFilter.
>> >
>> > Le mar. 11 oct. 2016 à 16:22, Kumaran Ramasubramanian
>> > 
>> > a écrit :
>> >
>> > > Hi All,
>> > >
>> > >   Is there any way to add ASCIIFoldingFilter over ClassicAnalyzer
>> without
>> > > writing a new custom analyzer ? should i extend StopwordAnalyzerBase
>> > again?
>> > >
>> > >
>> > > I know that ClassicAnalyzer is final. any special purpose for making
>> it as
>> > > final? Because, StandardAnalyzer was not final before ?
>> > >
>> > > public final class ClassicAnalyzer extends StopwordAnalyzerBase
>> > > >
>> > >
>> > >
>> > > --
>> > > Kumaran R
>> > >
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


All in one query

2016-10-18 Thread betty john
Hi
 Is there any function that performs exact match, fuzzy search and
prefix search?


Re: What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-18 Thread Adrien Grand
We already have CheckIndex that verifies that Fields.iterator() returns a
sorted iterator so I think we should improve the javadocs of
Fields.iterator() to make it explicit.

Le mar. 18 oct. 2016 à 05:15, Trejkaz  a écrit :

> Continuation, found a bug but I'm not sure whether it's in Lucene or
> Lucene's Javadoc.
>
> In MultiFields:
>
>   @SuppressWarnings({"unchecked","rawtypes"})
>   @Override
>   public Iterator iterator() {
> Iterator subIterators[] = new Iterator[subs.length];
> for(int i=0;i   subIterators[i] = subs[i].iterator();
> }
> return new MergedIterator<>(subIterators);
>   }
>
> MergedIterator says in the Javadoc:
>
> "The behavior is undefined if the iterators are not actually sorted."
>
> And indeed, the iterators are _not_ actually sorted. So I look at
> where they come from, Fields#iterator(), which is documented fairly
> tersely:
>
> "Returns an iterator that will step through all fields names.
> This will not return null."
>
> Which doesn't say anything about the names being in order. So I assume
> that either:
>
>   (a) Fields#iterator() is actually supposed to be sorted and the
> documentation should specify it but doesn't, or
>
>   (b) Fields#iterator() is not supposed to be sorted, but either
> MultiFields#iterator() or MergedIterator is supposed to be handling
> this better.
>
> Either way, I think it's a bug in Lucene. But since I don't know which
> direction it's in, and I don't have a reproducible test case I can
> just hand over, I can't easily file it. :/
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


java-user-subscribe

2016-10-18 Thread Kunio, Piotr
java-user-subscribe


Search optimization - regd

2016-10-18 Thread krish mohan
Hi..
I am building a search for my application. For the entered search term
(foo),
1) I look for exact match (foo), if it returns NULL
2) I use fuzzy search (foo~), if it returns NULL
3) I use wildcard (foo*).

Is this an efficient way? Or is there any lucene method to do all these?
Thanks.


POS tagging in Lucene

2016-10-18 Thread Niki Pavlopoulou
Hi all,

I am using Lucene and OpenNLP for POS tagging. I would like to support
biGrams with POS tags as well. For example, I would like something like
that:

Input: (I[PRP], am[VBP], using[VBG], Lucene[NNP])
Output: (I[PRP] am[VBP], am[VBP] using[VBG], using[VBG] Lucene[NNP])

The problem above is that I do not have "pure" tokens, like "I", "am" etc.,
so the analysis could be wrong if I add the POS tags as an input in Lucene.
Is there a way to solve this, apart from creating my custome Lucene
analyser?

Thank you in advance.

Regards,
Niki.