Fuzzy search expansion problem on 6.6.3
Hello all, I am running a solr 6.6.3 3-shard cloud with one main collection that contains 587,371,821 rows of data. One of the fields in this collection is names. We are currently running into an issue with fuzzy searches on name where it seems unable to get all possible values for a number of different names even when only querying for 1 change (~1). I've technically asked this question in the distant past and the answer I received at the time was to modify org.apache.lucene.search.FuzzySearch to have a larger defaultMaxExpansions value. For disclosure we also set defaultTranspositions to false as the customers did not like query results they were getting with it on. For a time this worked. However, within the last 6 months or so we've started seeing signs of this issue cropping up again. The two things that have changed since the original email is that we've migrated from 4.7.1 to 6.6.3 and we almost doubled the number of records in the index. With the hope that the old solution would still work, I've tweaked defaultMaxExpansions as high as 10240 with the requisite change to maxBooleanClauses to match and it seems to have had no effect. So much so that I am suspicious that the change is having no effect whatsoever. I am in the process of setting up a much more focused testing environment for just names, but figured I'd send this out to get some initial advice or suggestions on what I might have missed or should investigate. I've reviewed patch notes for versions before and after 6.6.3 to check for breaking changes from 4.7.1 or fixes in future versions and haven't seen anything. Thanks, Ryan Wilson
Solr 4.5 - Solr Cloud is creating new cores on random nodes
Hello all, I am currently in the process of building out a solr cloud with solr 4.5 on 4 nodes with some pretty hefty hardware. When we create the collection we have a replication factor of 2 and store 2 replicas per node. While we have been experimenting, which has involved bringing nodes up and down as well as tanking them with OOM errors while messing with jvm settings, we have observed a disturbing trend where we will bring nodes back up and suddenly shard x has 6 replicas spread across the nodes. These replicas will have been created with no action on our part and we would much rather they not be created at all. I have not been able to determine whether this is a bug or a feature. If its a bug, I will happily provide what I can to track it down. If it is a feature, I would very much like to turn it off. Any Information is appreciated. Regards, Ryan Wilson rpwils...@gmail.com
RE: Strange fuzzy behavior in 4.2.1
In answering your first questions, any changes we’ve been making have been followed by a reindex. The data that is being indexed generally looks something like this (space indicating an actual space): TIM space , space JULIO JULIE space , space JIM So based off what we see from looking at top terms in the field and the analysis tool, at index time these records are being broken up such that TIM , JULIO can be found with tim or Julio. Just to make sure I’m not misunderstanding something about Solr/Lucene, when a record is indexed the index analysis chain result (tim , julio) is what is written to disk correct? So far as I understand it it’s the query analysis chain that has the issue with most filters not being applied during wildcard and fuzzy queries. Finally, some clarification as I’ve realized my original email might not have made this point well. I can have a particular record with a primary key of X and a name value of LEWIS , JULIA and be able to find that exact record with bulia~1 but not aulia~1, or GUERRERO , JULIAN , JULIAN can be found with julan~1 but not julia~1. It’s not that records go missing when searched for with fuzzy, but rather the fuzzy terms that will find them seem, to my eyes, inconsistent. Regards, Ryan Wilson rpwils...@gmail.com
Re: Strange fuzzy behavior in 4.2.1
This might explain why our dev database of 400,000 records doesn't seem to suffer from this. When we started seeing this in our test environment of 300,000,000 records, we thought we just weren't finding records in dev that were having the problem. One thing that this does not explain is that we have located a few terms that find nothing but the original term, despite having possible matches one edit away. For example, albert will not find anything but albert, despite there being alberta, albart, etc. I am reading into the maxExpansion variable and how it functions as I am writing this, so I might be missing the connection. I note that you say this is a hardcoded behavior. Would I be safe in assuming that I will need to build a custom solr.war to make changes to this setting? I wan to see if sliding this number up/down will let me confirm that it is indeed maxExpansions that is the problem. Finally, if it is maxExpansions that is the problem is there any solution beyond the aforementioned custom war? -Ryan Wilson On Thu, May 16, 2013 at 8:40 AM, Jack Krupansky j...@basetechnology.comwrote: Maybe you are running into the same problem I posted on another message thread about the hard-coded maxExpansions limit of 50. In other words, once Lucene finds 50 terms that do match, it won't find the additional matches. And that is not necessarily the top 50, but the first 50 in the index. See if you can reproduce the problem with a small data set of no more than a couple dozen documents. -- Jack Krupansky -Original Message- From: Ryan Wilson Sent: Thursday, May 16, 2013 9:28 AM To: solr-user@lucene.apache.org Subject: RE: Strange fuzzy behavior in 4.2.1 In answering your first questions, any changes we’ve been making have been followed by a reindex. The data that is being indexed generally looks something like this (space indicating an actual space): TIM space , space JULIO JULIE space , space JIM So based off what we see from looking at top terms in the field and the analysis tool, at index time these records are being broken up such that TIM , JULIO can be found with tim or Julio. Just to make sure I’m not misunderstanding something about Solr/Lucene, when a record is indexed the index analysis chain result (tim , julio) is what is written to disk correct? So far as I understand it it’s the query analysis chain that has the issue with most filters not being applied during wildcard and fuzzy queries. Finally, some clarification as I’ve realized my original email might not have made this point well. I can have a particular record with a primary key of X and a name value of LEWIS , JULIA and be able to find that exact record with bulia~1 but not aulia~1, or GUERRERO , JULIAN , JULIAN can be found with julan~1 but not julia~1. It’s not that records go missing when searched for with fuzzy, but rather the fuzzy terms that will find them seem, to my eyes, inconsistent. Regards, Ryan Wilson rpwils...@gmail.com
Strange fuzzy behavior in 4.2.1
Hello all, I am currently trying to determine what is the cause of some odd behaviour when performing fuzzy queries in Solr 4.2.1. I have a field that is configured as follows: field type=textSomeField indexed=true stored=false multiValued=false name=stuff / fieldType name=textSomeField omitTermFreqAndPositions=false omitNorms=true termVectors=false termPositions=false termOffsets=false class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 preserveOriginal=1 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Fuzzy searches on this field (and others) gets some darned weird results. For example the names julie, julia, julian, julio, and juliar are indexed. The following occurs: stuff:(julia~1) - Only finds julia stuff:(julie~1) - finds julia and julie stuff:(julian~1) - only finds julian stuff:(julin~1) - finds julian, julia, julie, etc stuff:(juliz~1) - finds julia, julio, julie, etc This is one of the simple examples of the behaviour we are seeing. I will happily provide more if necessary. My question is why exactly I am getting the results that I am getting from fuzzy? My understanding of fuzzy is that it is the Levenshtein distance from one word to the next. Therefore, julia, julie, and julio should be returning results with each others names with an edit distance of 1 yet that is definitely not the behavior I am observing. I am uncertain of whether I have done something wrong with the indexing, querying, or am simply misunderstanding how fuzzy functions. Any help or clarification would be appreciated. Regards, Ryan Wilson rpwils...@gmail.com