Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian


Re: Getting a large number of documents by id

2013-07-18 Thread Alexandre Rafalovitch
You could start from doing id:(12345 23456) to reduce the query length and
possibly speed up parsing.
You could also move the query from 'q' parameter to 'fq' parameter, since
you probably don't care about ranking ('fq' does not rank).
If these are unique every time, you could probably look at not caching
(can't remember exact syntax).

That's all I can think of at the moment without digging deep into why you
need to do this at all.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



Re: Getting a large number of documents by id

2013-07-18 Thread Jack Krupansky
Solr really isn't designed for that kind of use case. If it happens to work 
well for your particular situation, great, but don't complain when you are 
well outside the normal usage for a search engine (10, 20, 50, 100 results 
paged at a time, with modest sized query strings.)


If you must get these 837 documents, do them in reasonable size batches, 
like 20, 50, or 100 at a time.


That said, there may be something else going on here, since a query for 837 
results should not take 4 seconds anyway.


Check QTime - is it 4 seconds?

Add debugQuery=true to your query and check the individual module times - 
which ones are the biggest hogs? Or, maybe it is none of them and the 
problem is elsewhere, like formatting the response, network problems, etc.


Hmmm... I wonder if the new real-time Get API would be better for your 
case. It takes a comma-separated list of document IDs (keys). Check it out:


http://wiki.apache.org/solr/RealTimeGet

-- Jack Krupansky

-Original Message- 
From: Brian Hurt

Sent: Thursday, July 18, 2013 10:46 AM
To: solr-user@lucene.apache.org
Subject: Getting a large number of documents by id

I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form id:12345 OR id:23456 OR
... and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian 



Re: Getting a large number of documents by id

2013-07-18 Thread Michael Della Bitta
Brian,

Have you tried the realtime get handler? It supports multiple documents.

http://wiki.apache.org/solr/RealTimeGet

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions https://twitter.com/Appinions | g+:
plus.google.com/appinions
w: appinions.com http://www.appinions.com/


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



Re: Getting a large number of documents by id

2013-07-18 Thread Roman Chyla
Look at speed of reading the data - likely, it takes long time to assemble
a big response, especially if there are many long fields - you may want to
try SSD disks, if you have that option.

Also, to gain better understanding: Start your solr, start jvisualvm and
attach to your running solr. Start sending queries and observe where the
most time is spent - it is very easy, you don't have to be a programmer to
do it.

The crucial parts are (but they will show up under different names) are:

1. query parsing
2. search execution
3. response assembly

quite likely, your query is a huge boolean OR clause, that may not be as
efficient as some filter query.

Your use case is actually not at all exotic. There will soon be a JIRA
ticket that makes the scenario of sending/querying with large number of IDs
less painful.

http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964
http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html

But I would really recommend you to do the jvisualvm measurement - that's
like bringing the light into darkness.

roman


On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

 I have a situation which is common in our current use case, where I need to
 get a large number (many hundreds) of documents by id.  What I'm doing
 currently is creating a large query of the form id:12345 OR id:23456 OR
 ... and sending it off.  Unfortunately, this query is taking a long time,
 especially the first time it's executed.  I'm seeing times of like 4+
 seconds for this query to return, to get 847 documents.

 So, my question is: what should I be looking at to improve the performance
 here?

 Brian



Re: Getting a large number of documents by id

2013-07-18 Thread Alexandre Rafalovitch
And I guess, if only a subset of fields is being requested but there are
other large fields present, there could be the cost of loading those extra
fields into memory before discarding them. In which case,
using enableLazyFieldLoading may help.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Thu, Jul 18, 2013 at 11:47 AM, Roman Chyla roman.ch...@gmail.com wrote:

 Look at speed of reading the data - likely, it takes long time to assemble
 a big response, especially if there are many long fields - you may want to
 try SSD disks, if you have that option.

 Also, to gain better understanding: Start your solr, start jvisualvm and
 attach to your running solr. Start sending queries and observe where the
 most time is spent - it is very easy, you don't have to be a programmer to
 do it.

 The crucial parts are (but they will show up under different names) are:

 1. query parsing
 2. search execution
 3. response assembly

 quite likely, your query is a huge boolean OR clause, that may not be as
 efficient as some filter query.

 Your use case is actually not at all exotic. There will soon be a JIRA
 ticket that makes the scenario of sending/querying with large number of IDs
 less painful.


 http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964

 http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html

 But I would really recommend you to do the jvisualvm measurement - that's
 like bringing the light into darkness.

 roman


 On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

  I have a situation which is common in our current use case, where I need
 to
  get a large number (many hundreds) of documents by id.  What I'm doing
  currently is creating a large query of the form id:12345 OR id:23456 OR
  ... and sending it off.  Unfortunately, this query is taking a long
 time,
  especially the first time it's executed.  I'm seeing times of like 4+
  seconds for this query to return, to get 847 documents.
 
  So, my question is: what should I be looking at to improve the
 performance
  here?
 
  Brian
 



Re: Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
Thanks everyone for the response.

On Thu, Jul 18, 2013 at 11:22 AM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 You could start from doing id:(12345 23456) to reduce the query length and
 possibly speed up parsing.


I didn't know about this syntax- it looks useful.


 You could also move the query from 'q' parameter to 'fq' parameter, since
 you probably don't care about ranking ('fq' does not rank).


Yes, I don't care about rank, so this helps.


 If these are unique every time, you could probably look at not caching
 (can't remember exact syntax).


That's all I can think of at the moment without digging deep into why you
 need to do this at all.


Short version of a long story: I'm implementing a graph database on top of
solr.  Which is not what solr is designed for, I know.  This is a case
where I'm following a set of edges from a given node to it's 847 children,
and I need to get the children.  And yes, I've looked at neo4j- it doesn't
help.



 Regards,
Alex.

 Personal website: http://www.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote:

  I have a situation which is common in our current use case, where I need
 to
  get a large number (many hundreds) of documents by id.  What I'm doing
  currently is creating a large query of the form id:12345 OR id:23456 OR
  ... and sending it off.  Unfortunately, this query is taking a long
 time,
  especially the first time it's executed.  I'm seeing times of like 4+
  seconds for this query to return, to get 847 documents.
 
  So, my question is: what should I be looking at to improve the
 performance
  here?
 
  Brian