Fuzzy search expansion problem on 6.6.3

2018-10-25 Thread Ryan Wilson
Hello all,

I am running a solr 6.6.3 3-shard cloud with one main collection that
contains 587,371,821 rows of data. One of the fields in this collection is
names. We are currently running into an issue with fuzzy searches on name
where it seems unable to get all possible values for a number of different
names even when only querying for 1 change (~1).

I've technically asked this question in the distant past and the answer I
received at the time was to modify org.apache.lucene.search.FuzzySearch to
have a larger defaultMaxExpansions value. For disclosure we also set
defaultTranspositions to false as the customers did not like query results
they were getting with it on. For a time this worked. However, within the
last 6 months or so we've started seeing signs of this issue cropping up
again.

The two things that have changed since the original email is that we've
migrated from 4.7.1 to 6.6.3 and we almost doubled the number of records in
the index. With the hope that the old solution would still work, I've
tweaked defaultMaxExpansions as high as 10240 with the requisite change to
maxBooleanClauses to match and it seems to have had no effect. So much so
that I am suspicious that the change is having no effect whatsoever. I am
in the process of setting up a much more focused testing environment for
just names, but figured I'd send this out to get some initial advice or
suggestions on what I might have missed or should investigate.

I've reviewed patch notes for versions before and after 6.6.3 to check for
breaking changes from 4.7.1 or fixes in future versions and haven't seen
anything.

Thanks,
Ryan Wilson


Solr 4.5 - Solr Cloud is creating new cores on random nodes

2013-12-18 Thread Ryan Wilson
Hello all,

I am currently in the process of building out a solr cloud with solr 4.5 on
4 nodes with some pretty hefty hardware. When we create the collection we
have a replication factor of 2 and store 2 replicas per node.

While we have been experimenting, which has involved bringing nodes up and
down as well as tanking them with OOM errors while messing with jvm
settings, we have observed a disturbing trend where we will bring nodes
back up and suddenly shard x has 6 replicas spread across the nodes. These
replicas will have been created with no action on our part and we would
much rather they not be created at all.

I have not been able to determine whether this is a bug or a feature. If
its a bug, I will happily provide what I can to track it down. If it is a
feature, I would very much like to turn it off.

Any Information is appreciated.

Regards,
Ryan Wilson
rpwils...@gmail.com


RE: Strange fuzzy behavior in 4.2.1

2013-05-16 Thread Ryan Wilson
In answering your first questions, any changes we’ve been making have been
followed by a reindex.



The data that is being indexed generally looks something like this (space
indicating an actual space):



TIM space , space JULIO

JULIE space , space JIM



So based off what we see from looking at top terms in the field and the
analysis tool, at index time these records are being broken up such that
TIM , JULIO can be found with tim or Julio.



Just to make sure I’m not misunderstanding something about Solr/Lucene,
when a record is indexed the index analysis chain result (tim ,
julio) is what is written to disk correct? So far as I understand it it’s
the query analysis chain that has the issue with most filters not being
applied during wildcard and fuzzy queries.



Finally, some clarification as I’ve realized my original email might not
have made this point well. I can have a particular record with a primary
key of X and a name value of LEWIS , JULIA and be able to find that exact
record with bulia~1 but not aulia~1,   or GUERRERO , JULIAN , JULIAN can be
found with julan~1 but not julia~1. It’s not that records go missing when
searched for with fuzzy, but rather the  fuzzy terms that will find them
seem, to my eyes, inconsistent.



Regards,

Ryan Wilson
rpwils...@gmail.com


Re: Strange fuzzy behavior in 4.2.1

2013-05-16 Thread Ryan Wilson
This might explain why our dev database of 400,000 records doesn't seem to
suffer from this.  When we started seeing this in our test environment of
300,000,000 records, we thought we just weren't finding records in dev that
were having the problem.

One thing that this does not explain is that we have located a few terms
that find nothing but the original term, despite having possible matches
one edit away. For example, albert will not find anything but albert,
despite there being alberta, albart, etc. I am reading into the
maxExpansion variable and how it functions as I am writing this, so I might
be missing the connection.

I note that you say this is a hardcoded behavior. Would I be safe in
assuming that I will need to build a custom solr.war to make changes to
this setting? I wan to see if sliding this number up/down will let me
confirm that it is indeed maxExpansions that is the problem.

Finally, if it is maxExpansions that is the problem is there any solution
beyond the aforementioned custom war?

-Ryan Wilson
On Thu, May 16, 2013 at 8:40 AM, Jack Krupansky j...@basetechnology.comwrote:

 Maybe you are running into the same problem I posted on another message
 thread about the hard-coded maxExpansions limit of 50. In other words, once
 Lucene finds 50 terms that do match, it won't find the additional matches.
 And that is not necessarily the top 50, but the first 50 in the index.

 See if you can reproduce the problem with a small data set of no more than
 a couple dozen documents.

 -- Jack Krupansky
 -Original Message- From: Ryan Wilson
 Sent: Thursday, May 16, 2013 9:28 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Strange fuzzy behavior in 4.2.1


 In answering your first questions, any changes we’ve been making have been
 followed by a reindex.



 The data that is being indexed generally looks something like this (space
 indicating an actual space):



 TIM space , space JULIO

 JULIE space , space JIM



 So based off what we see from looking at top terms in the field and the
 analysis tool, at index time these records are being broken up such that
 TIM , JULIO can be found with tim or Julio.



 Just to make sure I’m not misunderstanding something about Solr/Lucene,
 when a record is indexed the index analysis chain result (tim ,
 julio) is what is written to disk correct? So far as I understand it it’s
 the query analysis chain that has the issue with most filters not being
 applied during wildcard and fuzzy queries.



 Finally, some clarification as I’ve realized my original email might not
 have made this point well. I can have a particular record with a primary
 key of X and a name value of LEWIS , JULIA and be able to find that exact
 record with bulia~1 but not aulia~1,   or GUERRERO , JULIAN , JULIAN can be
 found with julan~1 but not julia~1. It’s not that records go missing when
 searched for with fuzzy, but rather the  fuzzy terms that will find them
 seem, to my eyes, inconsistent.



 Regards,

 Ryan Wilson
 rpwils...@gmail.com



Strange fuzzy behavior in 4.2.1

2013-05-14 Thread Ryan Wilson
Hello all,

I am currently trying to determine what is the cause of some odd behaviour
when performing fuzzy queries in Solr 4.2.1. I have a field that is
configured as follows:

field type=textSomeField indexed=true stored=false
multiValued=false name=stuff /

fieldType name=textSomeField omitTermFreqAndPositions=false
omitNorms=true termVectors=false termPositions=false
termOffsets=false class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1 generateNumberParts=0
catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 preserveOriginal=1 generateNumberParts=0
catenateWords=1 catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType

Fuzzy searches on this field (and others) gets some darned weird results.
For example the names julie, julia, julian, julio, and juliar are indexed.
The following occurs:

stuff:(julia~1) - Only finds julia
stuff:(julie~1) - finds julia and julie
stuff:(julian~1) - only finds julian
stuff:(julin~1) - finds julian, julia, julie, etc
stuff:(juliz~1) - finds julia, julio, julie, etc

This is one of the simple examples of the behaviour we are seeing. I will
happily provide more if necessary.

My question is why exactly I am getting the results that I am getting from
fuzzy? My understanding of fuzzy is that it is the Levenshtein distance
from one word to the next. Therefore, julia, julie, and julio should be
returning results with each others names with an edit distance of 1 yet
that is definitely not the behavior I am observing. I am uncertain of
whether I have done something wrong with the indexing, querying, or am
simply misunderstanding how fuzzy functions. Any help or clarification
would be appreciated.

Regards,
Ryan Wilson
rpwils...@gmail.com