RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
One of our main concerns is the solr returns the best match based on what it thinks is the best. It uses Levenshtein's distance metrics to determine the best suggestions. Can we tune this to put more weightage on the number of frequency/hits vs the number of edits ? If we can tune this, suggestions would seem more relevant when corrected.Also, if we can do this while keeping maxCollation = 1 and maxCollationTries = some reasonable number so that QTime does not go out of control that will be great! Any insights into this would be great. Thanks for your help. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058655.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
When getting collations there are two steps. First, the spellchecker gets individual word choices for each misspelled word. By default, these are sorted by string distance first, then document frequency second. You can override this by specifying str name=comparatorClassfreq/str in your spellchecker component configuration in solrconfig.xml. The example provided in the distribution has a commented-out section explaining this. In the second step, one correction is taken off each list and checked against the index to see if it is a valid collation. By valid, it needs to return at least 1 hit. The order in which words combinations are tried is dictated by the first step. Once it runs out of tries, runs out of suggestions, or has enough valid collations, it stops. You cannot configure this to try a bunch and sort by # hits or anything like that. You would have to specify a large # of collations to be returned and do this in your application. But this can run the risk of a high qtimes. So you can sort by frequency, but not by hits. Sorting by hits would mean trying a lot of collations and that is probably too expensive. One caveat is that sorting by frequency could result in far afield results being returned to the user. You might find that lower-frequency, smaller-edit-distance suggestions are going to give the user what they want more than higher-edit-distance, higher-frequency suggestions. Just because a word is very common doesn't mean it is the right word. This is why distance is the default and not freq. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Wednesday, April 24, 2013 12:13 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. One of our main concerns is the solr returns the best match based on what it thinks is the best. It uses Levenshtein's distance metrics to determine the best suggestions. Can we tune this to put more weightage on the number of frequency/hits vs the number of edits ? If we can tune this, suggestions would seem more relevant when corrected.Also, if we can do this while keeping maxCollation = 1 and maxCollationTries = some reasonable number so that QTime does not go out of control that will be great! Any insights into this would be great. Thanks for your help. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058655.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
James, Is there a way to determine how many times the collations were tried? Is there a parameter that can be issued that can return this in debug information? This would be very helpful. Appreciate your help with this. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058400.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
If you enable debug-level logging for class org.apache.solr.spelling.SpellCheckCollator, you should get a log message for every collation it tries like this: Collation: will return zzz hits. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Tuesday, April 23, 2013 2:13 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. James, Is there a way to determine how many times the collations were tried? Is there a parameter that can be issued that can return this in debug information? This would be very helpful. Appreciate your help with this. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058400.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
James, Thanks. That was very helpful. That helped me understand count and alternativeTermCount a bit more. I also have the following case as pointed out earlier... My query: http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit In this case, the intent is to correct chocolat factry with chocolate factory which exists in my spell field index. I see a QTime from the above query as somewhere between 350-400ms I run a similar query replacing the spellcheck terms to pursut hapyness whereas pursuit happyness actually exists in my spell field and I see QTime of 15-17ms . Both query produce collations correctly and picking the first suggestions and applying them as collation find what I am looking for but there is order of magnitude difference in QTime. There is one edit per term in both cases or 2 edits in each query. The length of words in both these queries seem identical. I'd like to understand why there is this vast difference in QTime. Also Chocolate factory and Pursuit happyness both are spellcheck indexed as is. I would appreciate any help with this since I am not sure how I can get any meaningful performance numbers and attribute the slowness to anything in particular. Thanks. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058048.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
On both queries, set spellcheck.extendedResults=true and also spellcheck.collateExtendedResults=true, then post the full spelling response. Also, how long does each query take on average with spellcheck turned off? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Monday, April 22, 2013 2:02 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. James, Thanks. That was very helpful. That helped me understand count and alternativeTermCount a bit more. I also have the following case as pointed out earlier... My query: http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit In this case, the intent is to correct chocolat factry with chocolate factory which exists in my spell field index. I see a QTime from the above query as somewhere between 350-400ms I run a similar query replacing the spellcheck terms to pursut hapyness whereas pursuit happyness actually exists in my spell field and I see QTime of 15-17ms . Both query produce collations correctly and picking the first suggestions and applying them as collation find what I am looking for but there is order of magnitude difference in QTime. There is one edit per term in both cases or 2 edits in each query. The length of words in both these queries seem identical. I'd like to understand why there is this vast difference in QTime. Also Chocolate factory and Pursuit happyness both are spellcheck indexed as is. I would appreciate any help with this since I am not sure how I can get any meaningful performance numbers and attribute the slowness to anything in particular. Thanks. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058048.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
Chocolat Factry ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime77/int /lst result name=response numFound=0 start=0 /result lst name=spellcheck lst name=suggestions lst name=chocolat int name=numFound1/int int name=startOffset0/int int name=endOffset8/int int name=origFreq615/int arr name=suggestion lst str name=wordchocolate/str int name=freq6544/int /lst /arr /lst lst name=factry int name=numFound5/int int name=startOffset9/int int name=endOffset15/int int name=origFreq6/int arr name=suggestion lst str name=wordfactory/str int name=freq23614/int /lst lst str name=wordfactor/str int name=freq5128/int /lst lst str name=wordfactus/str int name=freq290/int /lst lst str name=wordfactum/str int name=freq178/int /lst lst str name=wordfactae/str int name=freq102/int /lst /arr /lst bool name=correctlySpelledfalse/bool lst name=collation str name=collationQuerychocolate factory/str int name=hits85/int lst name=misspellingsAndCorrections str name=chocolatchocolate/str str name=factryfactory/str /lst /lst /lst /lst /response Pursut Hapyness ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 /result lst name=spellcheck lst name=suggestions lst name=pursut int name=numFound5/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordpursuit/str int name=freq1209/int /lst lst str name=wordpursue/str int name=freq108/int /lst lst str name=wordpursit/str int name=freq1/int /lst lst str name=wordperdut/str int name=freq94/int /lst lst str name=wordpurdue/str int name=freq70/int /lst /arr /lst lst name=hapyness int name=numFound5/int int name=startOffset7/int int name=endOffset15/int int name=origFreq0/int arr name=suggestion lst str name=wordhappyness/str int name=freq175/int /lst lst str name=wordhapiness/str int name=freq62/int /lst lst str name=wordhayness/str int name=freq1/int /lst lst str name=wordhappiness/str int name=freq7788/int /lst lst str name=wordharkness/str int name=freq324/int /lst /arr /lst bool name=correctlySpelledfalse/bool lst name=collation str name=collationQuerypursuit happyness/str int name=hits10/int lst name=misspellingsAndCorrections str name=pursutpursuit/str str name=hapynesshappyness/str /lst /lst /lst /lst /response Spellcheck is used separately and we are not using any q along with spellcheck. Our search query also queries other fields, not just spellcheck and therefore does not give a good representation of Qtime. We use groupings in the search query. For Chocolate Factory, I get a search QTime of 198ms For Pursuit Happyness, I get a search QTime of 318ms Would appreciate your insights. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058086.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
This doesn't make a lot of sense to me as in both cases the very first collation it tries is the one it is returning. So you're getting a very optimized spellcheck in both cases. But it does have to issue both queries 2 times: the first time, it tries the user's main query anding there are not enough hits, it then tries the collation query to see how many hits that will return. Could it be that these two queries just are less/more expensive and that difference gets magnified by running each twice? James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Monday, April 22, 2013 4:04 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. Chocolat Factry ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime77/int /lst result name=response numFound=0 start=0 /result lst name=spellcheck lst name=suggestions lst name=chocolat int name=numFound1/int int name=startOffset0/int int name=endOffset8/int int name=origFreq615/int arr name=suggestion lst str name=wordchocolate/str int name=freq6544/int /lst /arr /lst lst name=factry int name=numFound5/int int name=startOffset9/int int name=endOffset15/int int name=origFreq6/int arr name=suggestion lst str name=wordfactory/str int name=freq23614/int /lst lst str name=wordfactor/str int name=freq5128/int /lst lst str name=wordfactus/str int name=freq290/int /lst lst str name=wordfactum/str int name=freq178/int /lst lst str name=wordfactae/str int name=freq102/int /lst /arr /lst bool name=correctlySpelledfalse/bool lst name=collation str name=collationQuerychocolate factory/str int name=hits85/int lst name=misspellingsAndCorrections str name=chocolatchocolate/str str name=factryfactory/str /lst /lst /lst /lst /response Pursut Hapyness ?xml version=1.0 encoding=UTF-8? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 /result lst name=spellcheck lst name=suggestions lst name=pursut int name=numFound5/int int name=startOffset0/int int name=endOffset6/int int name=origFreq0/int arr name=suggestion lst str name=wordpursuit/str int name=freq1209/int /lst lst str name=wordpursue/str int name=freq108/int /lst lst str name=wordpursit/str int name=freq1/int /lst lst str name=wordperdut/str int name=freq94/int /lst lst str name=wordpurdue/str int name=freq70/int /lst /arr /lst lst name=hapyness int name=numFound5/int int name=startOffset7/int int name=endOffset15/int int name=origFreq0/int arr name=suggestion lst str name=wordhappyness/str int name=freq175/int /lst lst str name=wordhapiness/str int name=freq62/int /lst lst str name=wordhayness/str int name=freq1/int /lst lst str name=wordhappiness/str int name=freq7788/int /lst lst str name=wordharkness/str int name=freq324/int /lst /arr /lst bool name=correctlySpelledfalse/bool lst name=collation str name=collationQuerypursuit happyness/str int name=hits10/int lst name=misspellingsAndCorrections str name=pursutpursuit/str str name=hapynesshappyness/str /lst /lst /lst /lst /response Spellcheck is used separately and we are not using any q along with spellcheck. Our search query also queries other fields, not just spellcheck and therefore does not give a good representation of Qtime. We use groupings in the search query. For Chocolate Factory, I get a search QTime of 198ms For Pursuit Happyness, I get a search QTime of 318ms Would appreciate your insights. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4058086.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
I guess the first thing I'd do is to set maxCollationTries to zero. This means it will only run your main query once and not re-run it to check the collations. Now see if your queries have consistent qtime. One easy explanation is that with maxCollationTries=10, it may be running your query up to 11 times to check up to 10 possible collations. If the query takes 50ms by itself, then you've got 550ms total to not find spelling corrections. Unfortunately, the worst case here is the one that gives the user nothing back. Another thing to look at, with maxCollationTries at zero, set maxCollations to 10. This will give you a list of the 10 collations it would have tried. You can figure if the one that gets hits is far enough down the list to explain the high total qtime when maxCollationTries=10. If this explains it, then the obvious solution is to set maxCollationTries to something lower than 10. (you'll need tio weigh how long you're willing to make your users wait to possibly get spelling suggestions) Or possibly, use spellcheck.q to give it an easier query to evalutate than the main query (but that can still give valid collations). Also, see https://issues.apache.org/jira/browse/SOLR-3240 which is an optimization for this feature. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Thursday, April 18, 2013 11:33 PM To: solr-user@lucene.apache.org Subject: DirectSolrSpellChecker : vastly varying spellcheck QTime times. Hi! I am using SOLR 4.2.1. My solrconfig.xml contains the following: searchComponent name=MySpellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_spell/str lst name=spellchecker str name=nameMySpellchecker/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str str name=distanceMeasureinternal/str float name=accuracy0.5/float int name=maxEdits2/int int name=minPrefix1/int int name=maxInspections5/int int name=minQueryLength3/int float name=maxQueryFrequency0.01/float /lst /searchComponent requestHandler name=/select class=solr.SearchHandler startup=lazy lst name=defaults int name=rows10/int str name=dfid/str str name=spellcheck.dictionaryMySpellchecker/str str name=spellcheckon/str str name=spellcheck.extendedResultsfalse/str str name=spellcheck.count10/str str name=spellcheck.alternativeTermCount10/str str name=spellcheck.maxResultsForSuggest35/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.collatetrue/str str name=spellcheck.collateExtendedResultsfalse/str str name=spellcheck.maxCollationTries10/str str name=spellcheck.maxCollations1/str str name=spellcheck.collateParam.q.opAND/str /lst arr name=last-components strMySpellcheck/str /arr /requestHandler schema.xml with the spell field looks like: fieldType name=text_spell class=solr.TextField positionIncrementGap=100 sortMissingLast=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.StopFilterFactory ignoreCase=true words=lang/stopwords_en.txt enablePositionIncrements=true / /analyzer /fieldType field name=spell type=text_spell indexed=true stored=false multiValued=true / copyField source=title dest=spell / copyField source=artist dest=spell / My query: http://host/solr/select?q=spellcheck.q=chocolat%20factryspellcheck=truedf=spellfl=indent=onwt=xmlrows=10version=2.2echoParams=explicit In this case, the intent is to correct chocolat factry with chocolate factory which exists in my spell field index. I see a QTime from the above query as somewhere between 350-400ms I run a similar query replacing the spellcheck terms to pursut hapyness whereas pursuit happyness actually exists in my spell field and I see QTime of 15-17ms . Both query produce collations correctly but there is order of magnitude difference in QTime. There is one edit per term in both cases or 2 edits in each query. The length of words in both these queries seem identical. I'd like to understand why there is this vast difference in QTime. I would appreciate
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
James, Thanks for the reply. I see your point and sure enough, reducing maxCollationTries does reduce time, however may not produce results. It seems like the time is taken for the collations re-runs. Is there any way we can activate caching for collations. The same query repeatedly takes the same amount of time. My queryCaches are activated, however don't believe it gets used for spellchecks. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times.
I do not know what it would take to have the collation tests make betetr use of the QueryResultCache. However, outside of a test scenario, I do not know if this would help a lot. Hopefully you wouldn't have a lot of users issuing the exact same query with the exact same misspelled words over and over. In the real world, if you find that a collation is a better query than the one the user intially issued, then when that user pages through results, etc, your application should use the corrected query and not re-run the incorrect query over and over again. In the case of maxResultsForSuggest, if a user does the first query then rejects any did-you-mean suggstions, you can just turn spellcheck off if they page, facet, etc, so that you don't have to generate these suggestions over and over again. You do have to weigh when setting maxCollationTries whether or not it is acceptable to make a user with a misspelled query wait 1/2 second or so to (hopefully) get a correction, or if you want to simply reduce the maximum time someone will have to wait. If you find that it usually needs 10 tries to find a good collation, then you probably need to try a different distance algorithm, or play with the various accuracy settings to see if you can get better corrections to be nearer the top of the individual-word lists. Also, try setting alternativeTermCount lower than count (maybe set atc to 1/2 of what you have count). This will reduce the number of terms it has to try combinations of. If you set maxResultsForSuggest to a lower value (like 2-3, maybe), then it won't try to return did-you-mean suggestions for queries returning (was it 35?!) hits. As I mentioned, SOLR-3240 does have promise of speeding this feature up so maybe we won't have to talk about these kinds of trade-offs so much in the future. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: SandeepM [mailto:skmi...@hotmail.com] Sent: Friday, April 19, 2013 12:48 PM To: solr-user@lucene.apache.org Subject: RE: DirectSolrSpellChecker : vastly varying spellcheck QTime times. James, Thanks for the reply. I see your point and sure enough, reducing maxCollationTries does reduce time, however may not produce results. It seems like the time is taken for the collations re-runs. Is there any way we can activate caching for collations. The same query repeatedly takes the same amount of time. My queryCaches are activated, however don't believe it gets used for spellchecks. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/DirectSolrSpellChecker-vastly-varying-spellcheck-QTime-times-tp4057176p4057389.html Sent from the Solr - User mailing list archive at Nabble.com.