Funny behavior in facet query on large dataset
I am doing a facet query in Solr (3.4) and getting very bad performance. This is in a solr shard with 22 million records, but I am specifically doing a small time slice. However even if I take the time slice query out it takes the same amount of time, so it seems to be searching the entire data set. I am trying to find all documents that contain the word dude or thedude or anotherdude and count how many of these were written by eldudearino (of course names are changed here to protect the innocent...). My query is like this: http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22) Any ideas what I could be doing wrong? Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Funny behavior in facet query on large dataset
Faceting at that scale takes time to warm up. If you've got your caches and such configured appropriately, then successive searches will be very fast, however you'll still need to do the cache warming (depends on the faceting implementation you're using, in this case you're probably using the FieldCache). Faceting performance doesn't depend on the filters or query the caches that need to be built are indeed across the entire index. Erik On Oct 8, 2012, at 16:26 , kevinlieb wrote: I am doing a facet query in Solr (3.4) and getting very bad performance. This is in a solr shard with 22 million records, but I am specifically doing a small time slice. However even if I take the time slice query out it takes the same amount of time, so it seems to be searching the entire data set. I am trying to find all documents that contain the word dude or thedude or anotherdude and count how many of these were written by eldudearino (of course names are changed here to protect the innocent...). My query is like this: http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22) Any ideas what I could be doing wrong? Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Funny behavior in facet query on large dataset
: a small time slice. However even if I take the time slice query out it : takes the same amount of time, so it seems to be searching the entire data : set. a) you might try using facet.method=enum - in some special cases it may be faster then the default (facet.method=fc). : I am trying to find all documents that contain the word dude or thedude : or anotherdude and count how many of these were written by eldudearino : (of course names are changed here to protect the innocent...). b) field faceting isn't really designed for this type of problem. field faceting is very suitable for questions like find all docs matching QUERY, and for all of those docs, give me a list of hte top N authors and how many docs were written by those authors. c) If you just wnat to query for just the docs written by a single author, you cna use an fq like you do in your example, and then look at the numFound to know the total-- but in that case the faceting is just making extra work to generate counts of 0 for all of the other authors. d) if you want to query for an arbitrary set of documents, and then know how many of those documents were written by a particular author (or each of a particular set of authors) try facet.query instead. ...facet=truefacet.query=author_username:(%22@eldudearino%22) -Hoss
Re: Funny behavior in facet query on large dataset
Thanks for all the replies. I oversimplified the problem for the purposes of making my post small and concise. I am really trying to find the counts of documents by a list of 10 different authors that match those keywords. Of course on looking up a single author there is no reason to do a facet query. To be clearer: Find all documents that contain the word dude or thedude or anotherdude and count how many of these were written by eldudearino and zeedudearino and adudearino and beedudearino I tried facet.query as well as facet.method=fc and neither really helped. We are constantly adding documents to the solr index and committing, every few seconds, which is probably why this is not working well. Seems we need to re-architect the way we are doing this... -- View this message in context: http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Funny behavior in facet query on large dataset
On 10/8/2012 4:09 PM, kevinlieb wrote: Thanks for all the replies. I oversimplified the problem for the purposes of making my post small and concise. I am really trying to find the counts of documents by a list of 10 different authors that match those keywords. Of course on looking up a single author there is no reason to do a facet query. To be clearer: Find all documents that contain the word dude or thedude or anotherdude and count how many of these were written by eldudearino and zeedudearino and adudearino and beedudearino I tried facet.query as well as facet.method=fc and neither really helped. We are constantly adding documents to the solr index and committing, every few seconds, which is probably why this is not working well. Seems we need to re-architect the way we are doing this... I would definitely consider increasing the amount of time between commits. You can add documents at whatever interval you want, but if you only do commits every minute or two, your caches will be much more useful. Your time slice filter query (NOW-5MINUTES) will never be cached, because NOW is measured in milliseconds and will therefore be different for every query. You might consider doing NOW/MINUTE-5MINUTES instead .. or even [NOW/MINUTE-5MINUTES TO *] so that you actually are dealing with a range. For the space of that minute (at least until the cache gets invalidated by a commit), the filter cache entry will be valid. Some general questions that may matter: How big are all your index directories on this server, how much RAM is in the server, and how much RAM are you giving to Java? I'm also curious how big your Solr caches are, what the autowarm counts are, and how long it is taking for your caches to warm up after each commit. You can get the warm times from the cache statistics in the admin interface. Thanks, Shawn
Re: Funny behavior in facet query on large dataset
Hi Kevin, Right, it's the very frequent commits, most likely. Change commits to, say, every 60 or 120 seconds and compare the performance. I think you guys use SPM, so check the Cache graphs (hit % specifically) before and after the above change. Otis -- Search Analytics - http://sematext.com/search-analytics/index.html Performance Monitoring - http://sematext.com/spm/index.html On Mon, Oct 8, 2012 at 6:09 PM, kevinlieb ke...@politear.com wrote: Thanks for all the replies. I oversimplified the problem for the purposes of making my post small and concise. I am really trying to find the counts of documents by a list of 10 different authors that match those keywords. Of course on looking up a single author there is no reason to do a facet query. To be clearer: Find all documents that contain the word dude or thedude or anotherdude and count how many of these were written by eldudearino and zeedudearino and adudearino and beedudearino I tried facet.query as well as facet.method=fc and neither really helped. We are constantly adding documents to the solr index and committing, every few seconds, which is probably why this is not working well. Seems we need to re-architect the way we are doing this... -- View this message in context: http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html Sent from the Solr - User mailing list archive at Nabble.com.