Re: Retrieving large num of docs
Strange. Ever figured out the source of performance difference? Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Sat, December 5, 2009 12:05:49 PM Subject: Re: Retrieving large num of docs Hi Otis, I think my experiments are not conclusive about reduction in search time. I was playing around with various configurations to reduce the time to retrieve documents from Solr. I am sure that making the two multi valued text fields from stored to un-stored, retrieval time (query time + time to load the stored fields) became very fast. I was expecting the lazyfieldloading setting in solrconfig to take care of this but apparently it is not working as expected. Out of curiosity, I removed these 2 fields from the index (this time I am not even indexing them) and my search time got better (10 times better). However, I am still trying to isolate the reason for the search time reduction. It may be either because of 2 less fields to search in or because of the reduction in size of the index or may be something else. I am not sure if lazyfieldloading has any part in explaining this. - Raghu On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic wrote: Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
Hi Otis, I think my experiments are not conclusive about reduction in search time. I was playing around with various configurations to reduce the time to retrieve documents from Solr. I am sure that making the two multi valued text fields from stored to un-stored, retrieval time (query time + time to load the stored fields) became very fast. I was expecting the lazyfieldloading setting in solrconfig to take care of this but apparently it is not working as expected. Out of curiosity, I removed these 2 fields from the index (this time I am not even indexing them) and my search time got better (10 times better). However, I am still trying to isolate the reason for the search time reduction. It may be either because of 2 less fields to search in or because of the reduction in size of the index or may be something else. I am not sure if lazyfieldloading has any part in explaining this. - Raghu On Fri, Dec 4, 2009 at 3:07 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... enableLazyFieldLoadingtrue/enableLazyFieldLoading ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
Hm, hm, interesting. I was looking into something like this the other day (BIG indexed+stored text fields). After seeing enableLazyFieldLoading=true in solrconfig and after seeing fl didn't include those big fields, I though hm, so Lucene/Solr will not be pulling those large fields from disk, OK. You are saying that this may not be true based on your experiment? And what I'm calling your experiment means that you reindexed the same data, but without the 2 multi-valued text fields... .and that was the only change you made and got cca x10 search performance improvement? Sorry for repeating your words, just trying to confirm and understand. Thanks, Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Raghuveer Kancherla raghuveer.kanche...@aplopio.com To: solr-user@lucene.apache.org Sent: Thu, December 3, 2009 8:43:16 AM Subject: Re: Retrieving large num of docs Hi Hoss, I was experimenting with various queries to solve this problem and in one such test I remember that requesting only the ID did not change the retrieval time. To be sure, I tested it again using the curl command today and it confirms my previous observation. Also, enableLazyFieldLoading setting is set to true in my solrconfig. Another general observation (off topic) is that having a moderately large multi valued text field (~200 entries) in the index seems to slow down the search significantly. I removed the 2 multi valued text fields from my index and my search got ~10 time faster. :) - Raghu On Thu, Dec 3, 2009 at 2:14 AM, Chris Hostetter wrote: : I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... true ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
: I think I solved the problem of retrieving 300 docs per request for now. The : problem was that I was storing 2 moderately large multivalued text fields : though I was not retrieving them during search time. I reindexed all my : data without storing these fields. Now the response time (time for Solr to : return the http response) is very close to the QTime Solr is showing in the Hmmm two comments: 1) the example URL from your previous mail... : http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python ...doesn't match your earlier statemnet that you are only returning hte id field (there is no fl param in that URL) ... are you certain you werent' returning those large stored fields in teh response? 2) assuming you were actually using an fl param to limit the fields, make sure you have this setting in your solrconfig.xml... enableLazyFieldLoadingtrue/enableLazyFieldLoading ..that should make it pretty fast to return only a few fields of each document, even if you do have some jumpto stored fields that aren't being returned. -Hoss
Re: Retrieving large num of docs
Hi Hoss/Andrew, I think I solved the problem of retrieving 300 docs per request for now. The problem was that I was storing 2 moderately large multivalued text fields though I was not retrieving them during search time. I reindexed all my data without storing these fields. Now the response time (time for Solr to return the http response) is very close to the QTime Solr is showing in the logs. Thanks for all the help, Raghu On Mon, Nov 30, 2009 at 11:37 AM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Thanks Hoss, In my previous mail, I was measuring the system time difference between sending a (http) request and receiving a response. This was being run on a (different) client machine Like you suggested, I tried to time the response on the server itself as follows: $ /usr/bin/time -p curl -sS -o solr.out http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python real 3.49 user 0.00 sys 0.00 The query time in solr log shows me Qtime=600 size of solr.out is 843 kB. As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs, and we're quite perplexed as to whats going on. Thanks, Raghu On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : I am using Solr1.4 for searching through half a million documents. The : problem is, I want to retrieve nearly 200 documents for each search query. : The query time in Solr logs is showing 0.02 seconds and I am fairly happy : with that. However Solr is taking a long time (4 to 5 secs) to return the : results (I think it is because of the number of docs I am requesting). I : tried returning only the id's (unique key) without any other stored fields, : but it is not helping me improve the response times (time to return the id's : of matching documents). What exactly does your request URL look like, and how exactly are you timing the total response time? 200 isn't a very big number for the rows param -- people who want to get 100K documents back in their response at a time may have problems, but 200 is not that big. so like i said: how exactly are you timing things? My guess: it's more likely that network overhead or the performance of your client code (reading the data off the wire) is causing your timing code to seem slow, then it is that Solr is taking 5 seconds to write out those document IDs. I suspect if you try hitting the same exact URL using curl via localhost, you'll see the total response time be a lot less then 5 seconds. Here's an example of a query that asks solr to return *every* field from 500 documents, in the XML format. And these are not small documents... $ /usr/bin/time -p curl -sS -o /tmp/solr.out http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on real 0.07 user 0.00 sys 0.00 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 1.6M/tmp/solr.out ...that's 1.6 MB of 500 Solr documents with all of their fields in verbose XML format (including indenting) fetched in 70ms. If it's taking 5 seconds for you to get just the ids of 200 docs, you've got a problem somewhere and i'm 99% certain it's not in Solr. what does a similar time curl command for your URL look like when you run it on your solr server? -Hoss
Re: Retrieving large num of docs
: I am using Solr1.4 for searching through half a million documents. The : problem is, I want to retrieve nearly 200 documents for each search query. : The query time in Solr logs is showing 0.02 seconds and I am fairly happy : with that. However Solr is taking a long time (4 to 5 secs) to return the : results (I think it is because of the number of docs I am requesting). I : tried returning only the id's (unique key) without any other stored fields, : but it is not helping me improve the response times (time to return the id's : of matching documents). What exactly does your request URL look like, and how exactly are you timing the total response time? 200 isn't a very big number for the rows param -- people who want to get 100K documents back in their response at a time may have problems, but 200 is not that big. so like i said: how exactly are you timing things? My guess: it's more likely that network overhead or the performance of your client code (reading the data off the wire) is causing your timing code to seem slow, then it is that Solr is taking 5 seconds to write out those document IDs. I suspect if you try hitting the same exact URL using curl via localhost, you'll see the total response time be a lot less then 5 seconds. Here's an example of a query that asks solr to return *every* field from 500 documents, in the XML format. And these are not small documents... $ /usr/bin/time -p curl -sS -o /tmp/solr.out http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on; real 0.07 user 0.00 sys 0.00 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 1.6M/tmp/solr.out ...that's 1.6 MB of 500 Solr documents with all of their fields in verbose XML format (including indenting) fetched in 70ms. If it's taking 5 seconds for you to get just the ids of 200 docs, you've got a problem somewhere and i'm 99% certain it's not in Solr. what does a similar time curl command for your URL look like when you run it on your solr server? -Hoss
Re: Retrieving large num of docs
Thanks Hoss, In my previous mail, I was measuring the system time difference between sending a (http) request and receiving a response. This was being run on a (different) client machine Like you suggested, I tried to time the response on the server itself as follows: $ /usr/bin/time -p curl -sS -o solr.out http://localhost:1212/solr/select/?rows=300q=%28ResumeAllText%3A%28%28%28%22java+j2ee%22+%28java+j2ee%29%29%29%5E4%29%5E1.0%29start=0wt=python real 3.49 user 0.00 sys 0.00 The query time in solr log shows me Qtime=600 size of solr.out is 843 kB. As you've mentioned, Solr shouldn't give these kind of numbers for 300 docs, and we're quite perplexed as to whats going on. Thanks, Raghu On Mon, Nov 30, 2009 at 6:00 AM, Chris Hostetter hossman_luc...@fucit.orgwrote: : I am using Solr1.4 for searching through half a million documents. The : problem is, I want to retrieve nearly 200 documents for each search query. : The query time in Solr logs is showing 0.02 seconds and I am fairly happy : with that. However Solr is taking a long time (4 to 5 secs) to return the : results (I think it is because of the number of docs I am requesting). I : tried returning only the id's (unique key) without any other stored fields, : but it is not helping me improve the response times (time to return the id's : of matching documents). What exactly does your request URL look like, and how exactly are you timing the total response time? 200 isn't a very big number for the rows param -- people who want to get 100K documents back in their response at a time may have problems, but 200 is not that big. so like i said: how exactly are you timing things? My guess: it's more likely that network overhead or the performance of your client code (reading the data off the wire) is causing your timing code to seem slow, then it is that Solr is taking 5 seconds to write out those document IDs. I suspect if you try hitting the same exact URL using curl via localhost, you'll see the total response time be a lot less then 5 seconds. Here's an example of a query that asks solr to return *every* field from 500 documents, in the XML format. And these are not small documents... $ /usr/bin/time -p curl -sS -o /tmp/solr.out http://localhost:5051/solr/select/?q=doctype:productversion=2.2start=0rows=500indent=on real 0.07 user 0.00 sys 0.00 [chr...@c18-ssa-so-dfll-qry1 ~]$ du -sh /tmp/solr.out 1.6M/tmp/solr.out ...that's 1.6 MB of 500 Solr documents with all of their fields in verbose XML format (including indenting) fetched in 70ms. If it's taking 5 seconds for you to get just the ids of 200 docs, you've got a problem somewhere and i'm 99% certain it's not in Solr. what does a similar time curl command for your URL look like when you run it on your solr server? -Hoss
Re: Retrieving large num of docs
Hi Andrew, I applied the patch you suggested. I am not finding any significant changes in the response times. I am wondering if I forgot some important configuration setting etc. Here is what I did: 1. Wrote a small program using solrj to use EmbeddedSolrServer (most of the code is from the solr wiki) and run the server on an index of ~700k docs and note down the avg response time 2. Applied the SOLR-797.patch to the source code of Solr1.4 3. complied the source code and rebuilt the jar files. 4. Rerun step 1 using the new jar files. Am I supposed to do any other config changes in order to see the performance jump that you are able to achieve. Thanks a lot, Raghu On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote: Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? May be these links can help http://wiki.apache.org/lucene-java/ImproveSearchingSpeed http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr how often do you update your index? is your index optimized? configuring caching can also help: http://wiki.apache.org/solr/SolrCaching http://wiki.apache.org/solr/SolrPerformanceFactors
Re: Retrieving large num of docs
Hi Raghu Let me describe our use case in more details. Probably that will clarify things. The usual use case for Lucene/Solr is retrieving of small portion of the result set (10-20 documents). In our case we need to read the whole result set and this creates huge load on Lucene index, meaning a lot of IO. Keep in mind that we have large number of stored fields in the index. In our case there's one thing that makes things simpler: our index is so small that we can get every document in cache. This means that even if we retrieve all documents for every result set, we don't retrieve them from Lucene index and then the performance should be Ok. But here we've got 2 problems: 1. Solr caches Lucene's Document instances. And in case of retrieving the whole result set it recreates SolrDocument instances every time. This creates a load on CPU and in particular on Java GC. 2. EmbeddedSolrServer converts the whole response into a byte array and then restores it back converting Lucene's documents and DocList's to Solr's SolrDocument and SolrDocumentList instances. This create additional load on CPU and GC. We patched Solr to eliminate those things and that fixed our performance problems. I think that if you don't place all your documents in caches and/or you don't use stored fields, retrieving ID field only, then probably those improvements won't help you. I suggest you first to find your bottlenecks. Look at IO, memory usage etc. Using a profiler is the best thing too. Probably you can use some tools from lucidimation for profiling. On Sat, Nov 28, 2009 at 4:47 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi Andrew, I applied the patch you suggested. I am not finding any significant changes in the response times. I am wondering if I forgot some important configuration setting etc. Here is what I did: 1. Wrote a small program using solrj to use EmbeddedSolrServer (most of the code is from the solr wiki) and run the server on an index of ~700k docs and note down the avg response time 2. Applied the SOLR-797.patch to the source code of Solr1.4 3. complied the source code and rebuilt the jar files. 4. Rerun step 1 using the new jar files. Am I supposed to do any other config changes in order to see the performance jump that you are able to achieve. Thanks a lot, Raghu On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN iori...@yahoo.com wrote: Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? May be these links can help http://wiki.apache.org/lucene-java/ImproveSearchingSpeed http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr how often do you update your index? is your index optimized? configuring caching can also help: http://wiki.apache.org/solr/SolrCaching http://wiki.apache.org/solr/SolrPerformanceFactors -- Andrew Klochkov Senior Software Engineer, Grid Dynamics
Re: Retrieving large num of docs
Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? On a different note, I am wondering why does it take 4 - 5 seconds for Solr to return the ID's of ranked documents when it can rank the results in about 20 milli seconds? Am I missing something here? Thanks, Raghu On Fri, Nov 27, 2009 at 2:15 AM, Andrey Klochkov akloch...@griddynamics.com wrote: Hi We obtain ALL documents for every query, the index size is about 50k. We use number of stored fields. Often the result set size is several thousands of docs. We performed the following things to make it faster: 1. Use EmbeddedSolrServer 2. Patch Solr to avoid unnecessary marshalling while using EmbeddedSolrServer (there's an issue in Solr JIRA) 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document instances. I was going to share this patch, but then decided that our usage of Solr is not common and this functionality is useless in most cases 4. We have all documents in cache 5. In fact our index is stored in a data grid, not a file system. But as tests showed this is not important because standard FSDirectory is faster if you have enough of RAM free for OS caches. These changes improved the performance very much, so in the end we have performance comparable (about 3-5 times slower) to the proper Solr usage (obtaining first 20 documents). To get more details on how different Solr components perform we injected perf4j statements into key points in the code. And a profiler was helpful too. Hope it helps somehow. On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi, I am using Solr1.4 for searching through half a million documents. The problem is, I want to retrieve nearly 200 documents for each search query. The query time in Solr logs is showing 0.02 seconds and I am fairly happy with that. However Solr is taking a long time (4 to 5 secs) to return the results (I think it is because of the number of docs I am requesting). I tried returning only the id's (unique key) without any other stored fields, but it is not helping me improve the response times (time to return the id's of matching documents). I understand that retrieving 200 documents for each search term is impractical in most scenarios but I dont have any other option. Any pointers on how to improve the response times will be a great help. Thanks, Raghu -- Andrew Klochkov Senior Software Engineer, Grid Dynamics
Re: Retrieving large num of docs
Hi Andrew, We are running solr using its http interface from python. From the resources I could find, EmbeddedSolrServer is possible only if I am using solr from a java program. It will be useful to understand if a significant part of the performance increase is due to bypassing HTTP before going down this path. In the mean time I am trying my luck with the other suggestions. Can you share the patch that helps cache solr documents instead of lucene documents? May be these links can help http://wiki.apache.org/lucene-java/ImproveSearchingSpeed http://wiki.apache.org/lucene-java/ImproveIndexingSpeed http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr how often do you update your index? is your index optimized? configuring caching can also help: http://wiki.apache.org/solr/SolrCaching http://wiki.apache.org/solr/SolrPerformanceFactors
Re: Retrieving large num of docs
Hi We obtain ALL documents for every query, the index size is about 50k. We use number of stored fields. Often the result set size is several thousands of docs. We performed the following things to make it faster: 1. Use EmbeddedSolrServer 2. Patch Solr to avoid unnecessary marshalling while using EmbeddedSolrServer (there's an issue in Solr JIRA) 3. Patch Solr to cache SolrDocument instances instead of Lucene's Document instances. I was going to share this patch, but then decided that our usage of Solr is not common and this functionality is useless in most cases 4. We have all documents in cache 5. In fact our index is stored in a data grid, not a file system. But as tests showed this is not important because standard FSDirectory is faster if you have enough of RAM free for OS caches. These changes improved the performance very much, so in the end we have performance comparable (about 3-5 times slower) to the proper Solr usage (obtaining first 20 documents). To get more details on how different Solr components perform we injected perf4j statements into key points in the code. And a profiler was helpful too. Hope it helps somehow. On Thu, Nov 26, 2009 at 8:48 PM, Raghuveer Kancherla raghuveer.kanche...@aplopio.com wrote: Hi, I am using Solr1.4 for searching through half a million documents. The problem is, I want to retrieve nearly 200 documents for each search query. The query time in Solr logs is showing 0.02 seconds and I am fairly happy with that. However Solr is taking a long time (4 to 5 secs) to return the results (I think it is because of the number of docs I am requesting). I tried returning only the id's (unique key) without any other stored fields, but it is not helping me improve the response times (time to return the id's of matching documents). I understand that retrieving 200 documents for each search term is impractical in most scenarios but I dont have any other option. Any pointers on how to improve the response times will be a great help. Thanks, Raghu -- Andrew Klochkov Senior Software Engineer, Grid Dynamics