Funny behavior in facet query on large dataset

2012-10-08 Thread kevinlieb
I am doing a facet query in Solr (3.4) and getting very bad performance. 
This is in a solr shard with 22 million records, but I am specifically doing
a small time slice.  However even if I take the time slice query out it
takes the same amount of time, so it seems to be searching the entire data
set.

I am trying to find all documents that contain the word dude or thedude
or anotherdude and count how many of these were written by eldudearino
(of course names are changed here to protect the innocent...).

My query is like this: 

http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22)

Any ideas what I could be doing wrong?

Thanks in advance!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Erik Hatcher
Faceting at that scale takes time to warm up.  If you've got your caches and 
such configured appropriately, then successive searches will be very fast, 
however you'll still need to do the cache warming (depends on the faceting 
implementation you're using, in this case you're probably using the FieldCache).

Faceting performance doesn't depend on the filters or query the caches that 
need to be built are indeed across the entire index.

Erik

On Oct 8, 2012, at 16:26 , kevinlieb wrote:

 I am doing a facet query in Solr (3.4) and getting very bad performance. 
 This is in a solr shard with 22 million records, but I am specifically doing
 a small time slice.  However even if I take the time slice query out it
 takes the same amount of time, so it seems to be searching the entire data
 set.
 
 I am trying to find all documents that contain the word dude or thedude
 or anotherdude and count how many of these were written by eldudearino
 (of course names are changed here to protect the innocent...).
 
 My query is like this: 
 
 http://myserver:8080/solr/select/?fq=created_at:NOW-5MINUTESq=(+(text:(%22dude%22+%22thedude%22+%22%23anotherdude%22))+)facet=trueindent=onfacet.mincount=1wt=xmlversion=2.2rows=0fl=author_username,author_idfacet.field=author_usernamefq=author_username:(%22@eldudearino%22)
 
 Any ideas what I could be doing wrong?
 
 Thanks in advance!
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Chris Hostetter

: a small time slice.  However even if I take the time slice query out it
: takes the same amount of time, so it seems to be searching the entire data
: set.

a) you might try using facet.method=enum - in some special cases it may be 
faster then the default (facet.method=fc).

: I am trying to find all documents that contain the word dude or thedude
: or anotherdude and count how many of these were written by eldudearino
: (of course names are changed here to protect the innocent...).

b) field faceting isn't really designed for this type of problem.  field 
faceting is very suitable for questions like find all docs matching 
QUERY, and for all of those docs, give me a list of hte top N authors and 
how many docs were written by those authors.

c) If you just wnat to query for just the docs written by a single author, 
you cna use an fq like you do in your example, and then look at the 
numFound to know the total-- but in that case the faceting is just making 
extra work to generate counts of 0 for all of the other authors.

d) if you want to query for an arbitrary set of documents, and then know 
how many of those documents were written by a particular author (or each 
of a particular set of authors) try facet.query instead.

...facet=truefacet.query=author_username:(%22@eldudearino%22)


-Hoss


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread kevinlieb
Thanks for all the replies. 

I oversimplified the problem for the purposes of making my post small and
concise.  I am really trying to find the counts of documents by a list of 10
different authors that match those keywords.  Of course on looking up a
single author there is no reason to do a facet query.  To be clearer:
Find all documents that contain the word dude or thedude or
anotherdude and count how many of these were written by eldudearino and
zeedudearino and adudearino and beedudearino

I tried facet.query as well as facet.method=fc and neither really helped.

We are constantly adding documents to the solr index and committing, every
few seconds, which is probably why this is not working well.

Seems we need to re-architect the way we are doing this... 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Shawn Heisey

On 10/8/2012 4:09 PM, kevinlieb wrote:

Thanks for all the replies.

I oversimplified the problem for the purposes of making my post small and
concise.  I am really trying to find the counts of documents by a list of 10
different authors that match those keywords.  Of course on looking up a
single author there is no reason to do a facet query.  To be clearer:
Find all documents that contain the word dude or thedude or
anotherdude and count how many of these were written by eldudearino and
zeedudearino and adudearino and beedudearino

I tried facet.query as well as facet.method=fc and neither really helped.

We are constantly adding documents to the solr index and committing, every
few seconds, which is probably why this is not working well.

Seems we need to re-architect the way we are doing this...


I would definitely consider increasing the amount of time between 
commits.  You can add documents at whatever interval you want, but if 
you only do commits every minute or two, your caches will be much more 
useful.


Your time slice filter query (NOW-5MINUTES) will never be cached, 
because NOW is measured in milliseconds and will therefore be different 
for every query.  You might consider doing NOW/MINUTE-5MINUTES instead 
.. or even [NOW/MINUTE-5MINUTES TO *] so that you actually are dealing 
with a range.  For the space of that minute (at least until the cache 
gets invalidated by a commit), the filter cache entry will be valid.


Some general questions that may matter: How big are all your index 
directories on this server, how much RAM is in the server, and how much 
RAM are you giving to Java?  I'm also curious how big your Solr caches 
are, what the autowarm counts are, and how long it is taking for your 
caches to warm up after each commit.  You can get the warm times from 
the cache statistics in the admin interface.


Thanks,
Shawn



Re: Funny behavior in facet query on large dataset

2012-10-08 Thread Otis Gospodnetic
Hi Kevin,

Right, it's the very frequent commits, most likely.  Change commits
to, say, every 60 or 120 seconds and compare the performance.  I think
you guys use SPM, so check the Cache graphs (hit % specifically)
before and after the above change.

Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Oct 8, 2012 at 6:09 PM, kevinlieb ke...@politear.com wrote:
 Thanks for all the replies.

 I oversimplified the problem for the purposes of making my post small and
 concise.  I am really trying to find the counts of documents by a list of 10
 different authors that match those keywords.  Of course on looking up a
 single author there is no reason to do a facet query.  To be clearer:
 Find all documents that contain the word dude or thedude or
 anotherdude and count how many of these were written by eldudearino and
 zeedudearino and adudearino and beedudearino

 I tried facet.query as well as facet.method=fc and neither really helped.

 We are constantly adding documents to the solr index and committing, every
 few seconds, which is probably why this is not working well.

 Seems we need to re-architect the way we are doing this...



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Funny-behavior-in-facet-query-on-large-dataset-tp4012584p4012610.html
 Sent from the Solr - User mailing list archive at Nabble.com.