RE: Partial word match using n-grams

2013-07-30 Thread Becker, Thomas
Just to close the loop on this, I upgraded to 4.4 and the improvements to the 
NGramTokenizer were just what I needed.  I switched to using 1-2 grams (the 
default), and now that the tokenizer emits the tokens in an order that makes 
sense I'm in business.  At search time I split on whitespace, ngram the results 
and AND them together.   So matching quota_tommy with quo tom works as 
expected.  The ngram improvements are much appreciated!


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, July 19, 2013 2:42 PM
To: java-user
Subject: Re: Partial word match using n-grams

Well, it depends on what you put between your tokenizer and ngram filter. 
Putting WordDelimiterFilterFactory would break up on the underscore (and lots 
of other things besides) and submit the separate tokens which would then be 
n-grammed separately. That has other implications, of course, but you get the 
idea

There are a zillion possibilities here in terms of combining various 
filterFactories

Best
Erick

On Fri, Jul 19, 2013 at 9:06 AM, Becker, Thomas thomas.bec...@netapp.com 
wrote:
 Sorry, at indexing time it's not broken on anything.  In other words 
 quota_tommy yields these tokens: quo uot ota ta_ a_t _to tom omm mmy  I've 
 thought about trying to determine boundaries and breaking on them at indexing 
 time, but that will require some more thought.  It doesn't have to be an 
 underscore, that's only one possible convention.

 -Original Message-
 From: Shai Erera [mailto:ser...@gmail.com]
 Sent: Friday, July 19, 2013 8:53 AM
 To: java-user@lucene.apache.org
 Subject: Re: Partial word match using n-grams

 Wait, I didn't mean to pad the entire string. If the string is broken on _ 
 already, then NGramFilter already receives the individual terms and you can 
 put a Filter in front that will pass through a padded token?

 Shai


 On Fri, Jul 19, 2013 at 3:45 PM, Becker, Thomas 
 thomas.bec...@netapp.comwrote:

 In general the data for this field is that simple, but additional 
 characters are allowed beyond [a-z_].  Do I need to tokenize on whitespace?
  I really don't know.  Essentially, the question is whether we expect 
 quota tom to match quota_tom or not.  I spoke to some colleagues 
 and they thought it should since both quota and tom are partial 
 matches that would AND together.  Tokenizing the entire input 
 whitespace and all precludes this match.  I'd appreciate some input 
 from anyone on what the best user experience would be here; I'm 
 trying to operate on principle of least surprise ;)

 With regard to the padding suggestion, I'm still not sure this will work.
  Because again at indexing time there is typically no whitespace.  So 
 padding quota_tommy_1234 to ## quota_tommy_1234## before 
 trigramming is not going to produce a to#  token that I would need in order 
 for quota to
 to match.

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Friday, July 19, 2013 7:58 AM
 To: java-user@lucene.apache.org
 Subject: RE: Partial word match using n-grams

 Got it...almost.

 Y. You're right. FuzzyQuery is not at all what you want.

 Don't know if your data is actually as simple as this example.  Do you
 need to tokenize on whitespace?   Would it make sense to replace spaces in
 the query with underscores and then trigramify the whole query as if 
 it were a single term?

 
 From: Becker, Thomas [thomas.bec...@netapp.com]
 Sent: Thursday, July 18, 2013 8:59 PM
 To: java-user@lucene.apache.org
 Subject: RE: Partial word match using n-grams

 Thanks for the reply Tim.  I really should have been clearer.  Let's 
 say I have an object named quota_tommy_1234.  I'd like to match 
 that object with any 3 character (or more) substring of that name.  So for 
 example:

 quo
 tom
 234
 quota
 etc.

 Further, at search time I'm splitting input on whitespace before 
 tokenizing into PhraseQueries and then ANDing them together.  So 
 using the example above I also want the following queries to match:

 quo tom
 quo 234
 quota to - this is the problem because there are no trigrams of to

 That said, in response to your points:

 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
 misspellings, which is the main function of FuzzyQuery is it not?

 2) The original names are all going to be  3 characters, so there 
 are no
 1 or 2 letter terms at indexing time.  So generating the bigram to
 at search time will never match anything, unless I switch to bigrams 
 at indexing time also, which is what I'm asking about.

 3)  Again the names are all  3 characters so I don't need to pad at 
 indexing time.

 4) Hopefully my explanation above clarifies.

 I should point out that I'm a Lucene novice and am not at all sure 
 that what I'm doing is optimal.  But I have been impressed with how 
 easy it is to get something working very quickly!

 
 From

RE: Partial word match using n-grams

2013-07-19 Thread Becker, Thomas
In general the data for this field is that simple, but additional characters 
are allowed beyond [a-z_].  Do I need to tokenize on whitespace?  I really 
don't know.  Essentially, the question is whether we expect quota tom to 
match quota_tom or not.  I spoke to some colleagues and they thought it should 
since both quota and tom are partial matches that would AND together.  
Tokenizing the entire input whitespace and all precludes this match.  I'd 
appreciate some input from anyone on what the best user experience would be 
here; I'm trying to operate on principle of least surprise ;)

With regard to the padding suggestion, I'm still not sure this will work.  
Because again at indexing time there is typically no whitespace.  So padding 
quota_tommy_1234 to ## quota_tommy_1234## before trigramming is not going 
to produce a to#  token that I would need in order for quota to to match.

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Friday, July 19, 2013 7:58 AM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Got it...almost.  

Y. You're right. FuzzyQuery is not at all what you want.

Don't know if your data is actually as simple as this example.  Do you need to 
tokenize on whitespace?   Would it make sense to replace spaces in the query 
with underscores and then trigramify the whole query as if it were a single 
term?  


From: Becker, Thomas [thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 8:59 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Thanks for the reply Tim.  I really should have been clearer.  Let's say I have 
an object named quota_tommy_1234.  I'd like to match that object with any 3 
character (or more) substring of that name.  So for example:

quo
tom
234
quota
etc.

Further, at search time I'm splitting input on whitespace before tokenizing 
into PhraseQueries and then ANDing them together.  So using the example above I 
also want the following queries to match:

quo tom
quo 234
quota to - this is the problem because there are no trigrams of to

That said, in response to your points:

1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
misspellings, which is the main function of FuzzyQuery is it not?

2) The original names are all going to be  3 characters, so there are no 1 or 
2 letter terms at indexing time.  So generating the bigram to at search time 
will never match anything, unless I switch to bigrams at indexing time also, 
which is what I'm asking about.

3)  Again the names are all  3 characters so I don't need to pad at indexing 
time.

4) Hopefully my explanation above clarifies.

I should point out that I'm a Lucene novice and am not at all sure that what 
I'm doing is optimal.  But I have been impressed with how easy it is to get 
something working very quickly!


From: Allison, Timothy B. [talli...@mitre.org]
Sent: Thursday, July 18, 2013 7:49 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Tommy,
  I'm sure that I don't fully understand your use case and your data.  Some 
thoughts:

1) I assume that fuzzy term search (edit distance = 2) isn't meeting your 
needs or else you wouldn't have gone the ngram route.  If fuzzy term search + 
phrase/proximity search would meet your needs, see if ComplexPhraseQueryParser 
would work (although it looks like you're already building your own queries).

2) Would it make sense to modify NGramFilter so that it outputs a bigram for a 
two letter term and a unigram for a one letter term?  Might be messy...and ab 
in this scenario would never match abc

3) Would it make sense to pad your terms behind the scenes with ##...this 
would add bloat, but not nearly as much as variable gram sizes with 1= n =3

ab - ##ab## yields trigrams ##a, #ab, ab#, b##

4) How partial and what types of partial do you need?  This is related to 1).  
If minimum edit distance is sufficient; use it, especially with the blazing 
fast automaton (thank you, Robert Muir). If you have a smallish dataset you 
might consider allowing leading wildcards so that you could easily find all 
words, for example, containing abc with *abc*.  If your dataset is larger, you 
might consider something like ReversedWildcardFilterFactory (Solr) to speed 
this type of matching.

I look forward to other opinions from the list.

-Original Message-
From: Becker, Thomas [mailto:thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 3:55 PM
To: java-user@lucene.apache.org
Subject: Partial word match using n-grams

One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like abc ab since we first split on whitespace and 
then construct

RE: Partial word match using n-grams

2013-07-19 Thread Becker, Thomas
Sorry, at indexing time it's not broken on anything.  In other words 
quota_tommy yields these tokens: quo uot ota ta_ a_t _to tom omm mmy  I've 
thought about trying to determine boundaries and breaking on them at indexing 
time, but that will require some more thought.  It doesn't have to be an 
underscore, that's only one possible convention.

-Original Message-
From: Shai Erera [mailto:ser...@gmail.com] 
Sent: Friday, July 19, 2013 8:53 AM
To: java-user@lucene.apache.org
Subject: Re: Partial word match using n-grams

Wait, I didn't mean to pad the entire string. If the string is broken on _ 
already, then NGramFilter already receives the individual terms and you can put 
a Filter in front that will pass through a padded token?

Shai


On Fri, Jul 19, 2013 at 3:45 PM, Becker, Thomas thomas.bec...@netapp.comwrote:

 In general the data for this field is that simple, but additional 
 characters are allowed beyond [a-z_].  Do I need to tokenize on whitespace?
  I really don't know.  Essentially, the question is whether we expect 
 quota tom to match quota_tom or not.  I spoke to some colleagues and 
 they thought it should since both quota and tom are partial 
 matches that would AND together.  Tokenizing the entire input 
 whitespace and all precludes this match.  I'd appreciate some input 
 from anyone on what the best user experience would be here; I'm trying 
 to operate on principle of least surprise ;)

 With regard to the padding suggestion, I'm still not sure this will work.
  Because again at indexing time there is typically no whitespace.  So 
 padding quota_tommy_1234 to ## quota_tommy_1234## before 
 trigramming is not going to produce a to#  token that I would need in order 
 for quota to
 to match.

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Friday, July 19, 2013 7:58 AM
 To: java-user@lucene.apache.org
 Subject: RE: Partial word match using n-grams

 Got it...almost.

 Y. You're right. FuzzyQuery is not at all what you want.

 Don't know if your data is actually as simple as this example.  Do you
 need to tokenize on whitespace?   Would it make sense to replace spaces in
 the query with underscores and then trigramify the whole query as if 
 it were a single term?

 
 From: Becker, Thomas [thomas.bec...@netapp.com]
 Sent: Thursday, July 18, 2013 8:59 PM
 To: java-user@lucene.apache.org
 Subject: RE: Partial word match using n-grams

 Thanks for the reply Tim.  I really should have been clearer.  Let's 
 say I have an object named quota_tommy_1234.  I'd like to match that 
 object with any 3 character (or more) substring of that name.  So for example:

 quo
 tom
 234
 quota
 etc.

 Further, at search time I'm splitting input on whitespace before 
 tokenizing into PhraseQueries and then ANDing them together.  So using 
 the example above I also want the following queries to match:

 quo tom
 quo 234
 quota to - this is the problem because there are no trigrams of to

 That said, in response to your points:

 1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
 misspellings, which is the main function of FuzzyQuery is it not?

 2) The original names are all going to be  3 characters, so there are 
 no
 1 or 2 letter terms at indexing time.  So generating the bigram to 
 at search time will never match anything, unless I switch to bigrams 
 at indexing time also, which is what I'm asking about.

 3)  Again the names are all  3 characters so I don't need to pad at 
 indexing time.

 4) Hopefully my explanation above clarifies.

 I should point out that I'm a Lucene novice and am not at all sure 
 that what I'm doing is optimal.  But I have been impressed with how 
 easy it is to get something working very quickly!

 
 From: Allison, Timothy B. [talli...@mitre.org]
 Sent: Thursday, July 18, 2013 7:49 PM
 To: java-user@lucene.apache.org
 Subject: RE: Partial word match using n-grams

 Tommy,
   I'm sure that I don't fully understand your use case and your data.
  Some thoughts:

 1) I assume that fuzzy term search (edit distance = 2) isn't meeting 
 your needs or else you wouldn't have gone the ngram route.  If fuzzy 
 term search
 + phrase/proximity search would meet your needs, see if
 ComplexPhraseQueryParser would work (although it looks like you're 
 already building your own queries).

 2) Would it make sense to modify NGramFilter so that it outputs a 
 bigram for a two letter term and a unigram for a one letter term?  
 Might be messy...and ab in this scenario would never match abc

 3) Would it make sense to pad your terms behind the scenes with 
 ##...this would add bloat, but not nearly as much as variable gram 
 sizes with 1= n =3

 ab - ##ab## yields trigrams ##a, #ab, ab#, b##

 4) How partial and what types of partial do you need?  This is related 
 to 1).  If minimum edit distance is sufficient; use it, especially 
 with the blazing

Partial word match using n-grams

2013-07-18 Thread Becker, Thomas
One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like abc ab since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the word.  
Obviously we cannot get a trigram out of ab.  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just feels like 
a query for the word abcdef shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy




RE: Partial word match using n-grams

2013-07-18 Thread Becker, Thomas
Thanks for the reply Tim.  I really should have been clearer.  Let's say I have 
an object named quota_tommy_1234.  I'd like to match that object with any 3 
character (or more) substring of that name.  So for example:

quo
tom 
234
quota
etc.

Further, at search time I'm splitting input on whitespace before tokenizing 
into PhraseQueries and then ANDing them together.  So using the example above I 
also want the following queries to match:

quo tom
quo 234 
quota to - this is the problem because there are no trigrams of to

That said, in response to your points:

1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
misspellings, which is the main function of FuzzyQuery is it not?

2) The original names are all going to be  3 characters, so there are no 1 or 
2 letter terms at indexing time.  So generating the bigram to at search time 
will never match anything, unless I switch to bigrams at indexing time also, 
which is what I'm asking about.

3)  Again the names are all  3 characters so I don't need to pad at indexing 
time.

4) Hopefully my explanation above clarifies.

I should point out that I'm a Lucene novice and am not at all sure that what 
I'm doing is optimal.  But I have been impressed with how easy it is to get 
something working very quickly!


From: Allison, Timothy B. [talli...@mitre.org]
Sent: Thursday, July 18, 2013 7:49 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Tommy,
  I'm sure that I don't fully understand your use case and your data.  Some 
thoughts:

1) I assume that fuzzy term search (edit distance = 2) isn't meeting your 
needs or else you wouldn't have gone the ngram route.  If fuzzy term search + 
phrase/proximity search would meet your needs, see if ComplexPhraseQueryParser 
would work (although it looks like you're already building your own queries).

2) Would it make sense to modify NGramFilter so that it outputs a bigram for a 
two letter term and a unigram for a one letter term?  Might be messy...and ab 
in this scenario would never match abc

3) Would it make sense to pad your terms behind the scenes with ##...this 
would add bloat, but not nearly as much as variable gram sizes with 1= n =3

ab - ##ab## yields trigrams ##a, #ab, ab#, b##

4) How partial and what types of partial do you need?  This is related to 1).  
If minimum edit distance is sufficient; use it, especially with the blazing 
fast automaton (thank you, Robert Muir). If you have a smallish dataset you 
might consider allowing leading wildcards so that you could easily find all 
words, for example, containing abc with *abc*.  If your dataset is larger, you 
might consider something like ReversedWildcardFilterFactory (Solr) to speed 
this type of matching.

I look forward to other opinions from the list.

-Original Message-
From: Becker, Thomas [mailto:thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 3:55 PM
To: java-user@lucene.apache.org
Subject: Partial word match using n-grams

One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like abc ab since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the word.  
Obviously we cannot get a trigram out of ab.  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just feels like 
a query for the word abcdef shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: query on exact match in lucene

2013-07-17 Thread Becker, Thomas
Sounds like you need a PhraseQuery.

-Original Message-
From: madan mp [mailto:madan20...@gmail.com] 
Sent: Wednesday, July 17, 2013 7:40 AM
To: java-user@lucene.apache.org
Subject: query on exact match in lucene

how to get exact string match

ex- i am searching for file which consist of string i am fine but it use to 
throw file which consist string am i fine  but i need those file having i am 
fine

please help me out on this one.






 regards

 madan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



What to do with Lucene Version parameter on upgrade

2013-06-20 Thread Becker, Thomas
I'm relatively new to Lucene and am in the process of upgrading from 4.0 to 
4.3.1.  I'm trying to figure out if I need to leave my version at LUCENE_40 or 
if it is safe to change it to LUCENE_43.  Does this parameter directly 
determine the index format?  I have some existing indexes from 4.0 but am fine 
with whatever changes Lucene makes to them; I'm not concerned with the internal 
format.  But if I set my version to LUCENE_43 will the IndexReader/Writer not 
work with my old indexes?

Regards,
Tommy


Detecting when an index was not closed properly

2013-04-05 Thread Becker, Thomas
We are doing some crash resiliency testing of our application.  One of the 
things we found is that the Lucene index seems to get out of sync with the 
database pretty easily.  I suspect this is because we are using near real time 
readers and never actually calling IndexWriter.commit().  I'm trying to decide 
on the best way to handle this problem.  One is obviously we could move to 
calling commit() when we update the index.  Alternatively, we could rebuild the 
index fairly easily if we knew that it was closed improperly.  Is there an easy 
way to detect this?  Or am I wrong to avoid calling commit()?

Thanks,
Tommy


updateDocument question

2013-02-06 Thread Becker, Thomas
I've built a search prototype feature for my application using Lucene, and it 
works great.  The application monitors a remote system and currently indexes 
just a few core attributes of the objects on that system.  I get notifications 
when objects change, and I then update the Lucene index to keep things in sync. 
  The thing is that even when objects on the remote system are updated, it's 
relatively unlikely that the specific attributes I'm indexing (like name) were 
changed.  From what I can see, IndexWriter.updateDocument() makes no effort to 
determine if the existing document is actually dirty compared to the provided 
one.  My questions are:

Is this true that documents are assumed to be changed and not actually checked 
before replacement?

Has such a feature been considered?

Is it worth it to query for the document, manually dirty check it and then 
delete/re-add only if it's different if changes to the indexed fields are 
relatively uncommon?  My concern is that I'm inadvertently causing a lot of 
segment churn for things that aren't actually changing.

Thanks in advance,
Tommy