Getting a large number of documents by id
I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
You could start from doing id:(12345 23456) to reduce the query length and possibly speed up parsing. You could also move the query from 'q' parameter to 'fq' parameter, since you probably don't care about ranking ('fq' does not rank). If these are unique every time, you could probably look at not caching (can't remember exact syntax). That's all I can think of at the moment without digging deep into why you need to do this at all. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
Solr really isn't designed for that kind of use case. If it happens to work well for your particular situation, great, but don't complain when you are well outside the normal usage for a search engine (10, 20, 50, 100 results paged at a time, with modest sized query strings.) If you must get these 837 documents, do them in reasonable size batches, like 20, 50, or 100 at a time. That said, there may be something else going on here, since a query for 837 results should not take 4 seconds anyway. Check QTime - is it 4 seconds? Add debugQuery=true to your query and check the individual module times - which ones are the biggest hogs? Or, maybe it is none of them and the problem is elsewhere, like formatting the response, network problems, etc. Hmmm... I wonder if the new real-time Get API would be better for your case. It takes a comma-separated list of document IDs (keys). Check it out: http://wiki.apache.org/solr/RealTimeGet -- Jack Krupansky -Original Message- From: Brian Hurt Sent: Thursday, July 18, 2013 10:46 AM To: solr-user@lucene.apache.org Subject: Getting a large number of documents by id I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
Brian, Have you tried the realtime get handler? It supports multiple documents. http://wiki.apache.org/solr/RealTimeGet Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions w: appinions.com http://www.appinions.com/ On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
Look at speed of reading the data - likely, it takes long time to assemble a big response, especially if there are many long fields - you may want to try SSD disks, if you have that option. Also, to gain better understanding: Start your solr, start jvisualvm and attach to your running solr. Start sending queries and observe where the most time is spent - it is very easy, you don't have to be a programmer to do it. The crucial parts are (but they will show up under different names) are: 1. query parsing 2. search execution 3. response assembly quite likely, your query is a huge boolean OR clause, that may not be as efficient as some filter query. Your use case is actually not at all exotic. There will soon be a JIRA ticket that makes the scenario of sending/querying with large number of IDs less painful. http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964 http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html But I would really recommend you to do the jvisualvm measurement - that's like bringing the light into darkness. roman On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
And I guess, if only a subset of fields is being requested but there are other large fields present, there could be the cost of loading those extra fields into memory before discarding them. In which case, using enableLazyFieldLoading may help. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 18, 2013 at 11:47 AM, Roman Chyla roman.ch...@gmail.com wrote: Look at speed of reading the data - likely, it takes long time to assemble a big response, especially if there are many long fields - you may want to try SSD disks, if you have that option. Also, to gain better understanding: Start your solr, start jvisualvm and attach to your running solr. Start sending queries and observe where the most time is spent - it is very easy, you don't have to be a programmer to do it. The crucial parts are (but they will show up under different names) are: 1. query parsing 2. search execution 3. response assembly quite likely, your query is a huge boolean OR clause, that may not be as efficient as some filter query. Your use case is actually not at all exotic. There will soon be a JIRA ticket that makes the scenario of sending/querying with large number of IDs less painful. http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html#a4070964 http://lucene.472066.n3.nabble.com/ACL-implementation-Pseudo-join-performance-amp-Atomic-Updates-td4077894.html But I would really recommend you to do the jvisualvm measurement - that's like bringing the light into darkness. roman On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian
Re: Getting a large number of documents by id
Thanks everyone for the response. On Thu, Jul 18, 2013 at 11:22 AM, Alexandre Rafalovitch arafa...@gmail.comwrote: You could start from doing id:(12345 23456) to reduce the query length and possibly speed up parsing. I didn't know about this syntax- it looks useful. You could also move the query from 'q' parameter to 'fq' parameter, since you probably don't care about ranking ('fq' does not rank). Yes, I don't care about rank, so this helps. If these are unique every time, you could probably look at not caching (can't remember exact syntax). That's all I can think of at the moment without digging deep into why you need to do this at all. Short version of a long story: I'm implementing a graph database on top of solr. Which is not what solr is designed for, I know. This is a case where I'm following a set of edges from a given node to it's 847 children, and I need to get the children. And yes, I've looked at neo4j- it doesn't help. Regards, Alex. Personal website: http://www.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt bhur...@gmail.com wrote: I have a situation which is common in our current use case, where I need to get a large number (many hundreds) of documents by id. What I'm doing currently is creating a large query of the form id:12345 OR id:23456 OR ... and sending it off. Unfortunately, this query is taking a long time, especially the first time it's executed. I'm seeing times of like 4+ seconds for this query to return, to get 847 documents. So, my question is: what should I be looking at to improve the performance here? Brian