Re: Improving performance to return 2000+ documents
Thanks Erick/Jagdish. Just to give some background on my queries. 1. All my queries are unique. A query can be: ipod and ipod 8gb (but these are unique). These are about 1.2M in total. So, I assume setting a high queryResultCache, queryResultWindowSize and queryResultMaxDocsCached won't help. 2. I have this cache settings: documentCache class=solr.LRUCache size=1 initialSize=1 autowarmCount=0 cleanupThread=true/ //My understanding is, documentCache will help me the most because solr will cache documents retrieved. //Stats for documentCache: http://apaste.info/hknh queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0 cleanupThread=true/ //Default, since my queries are unique. filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ //Now sure how can I use filterCache, so I am keeping it as the default enableLazyFieldLoadingtrue/enableLazyFieldLoading queryResultWindowSize100/queryResultWindowSize queryResultMaxDocsCached100/queryResultMaxDocsCached I think the question can also be framed as: How can I optimize solr response time for 50M product catalog for unique queries which retrieves 2000 documents in one go. I looked at a solr search component, I think writing a proxy around solr was easier, so I went ahead with this approach. Thanks, -Utkarsh On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote: Solrconfig.xml has got entries which you can tweak for your use case. One of them is queryresultwindowsize. You can try using the value of 2000 and see if it helps improving performance. Please make sure you have enough memory allocated for queryresultcache. A combination of sharding and distribution of workload(requesting 2000/number of shards) with an aggregator would be a good way to maximize performance. Thanks, Jagdish On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com wrote: 50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell. But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and does the right thing. Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com wrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick
Re: Improving performance to return 2000+ documents
Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.comwrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com wrote: Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I have a
Re: Improving performance to return 2000+ documents
50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell. But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and does the right thing. Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com wrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com wrote: Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0
Re: Improving performance to return 2000+ documents
Solrconfig.xml has got entries which you can tweak for your use case. One of them is queryresultwindowsize. You can try using the value of 2000 and see if it helps improving performance. Please make sure you have enough memory allocated for queryresultcache. A combination of sharding and distribution of workload(requesting 2000/number of shards) with an aggregator would be a good way to maximize performance. Thanks, Jagdish On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.comwrote: 50M documents, depending on a bunch of things, may not be unreasonable for a single node, only testing will tell. But the question I have is whether you should be using standard Solr queries for this or building a custom component that goes at the base Lucene index and does the right thing. Or even re-indexing your entire corpus periodically to add this kind of data. FWIW, Erick On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Thanks Erick/Peter. This is an offline process, used by a relevancy engine implemented around solr. The engine computes boost scores for related keywords based on clickstream data. i.e.: say clickstream has: ipad=upc1,upc2,upc3 I query solr with keyword: ipad (to get 2000 documents) and then make 3 individual queries for upc1,upc2,upc3 (which are fast). The data is then used to compute related keywords to ipad with their boost values. So, I cannot really replace that, since I need full text search over my dataset to retrieve top 2000 documents. I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...), but don't see any improvements. Some questions: 1. Maybe the JVM size might help? This is what I see in the dashboard: Physical Memory 76.2% Swap Space NaN% (don't have any swap space, running on AWS EBS) File Descriptor Count 4.7% JVM-Memory 73.8% Screenshot: http://i.imgur.com/aegKzP6.png 2. Will reducing the shards from 3 to 1 improve performance? (maybe increase the RAM from 30 to 60GB) The problem I will face in that case will be fitting 50M documents on 1 machine. Thanks, -Utkarsh On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com wrote: Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com wrote: Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes
Re: Improving performance to return 2000+ documents
Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total transferred: 766221660 bytes HTML transferred: 766191806 bytes Requests per second:13.89 [#/sec] (mean) Time per request: 719.981 [ms] (mean) Time per request: 71.998 [ms] (mean, across all concurrent requests) Transfer rate: 20785.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.6 0 8 Processing: 9 717 2339.6199 12611 Waiting:9 635 2233.6164 12580 Total: 9 718 2339.6199 12611 Percentage of the requests served within a certain time (ms) 50%199 66%236 75%263 80%281 90%548 95%838 98% 12475 99% 12545 100% 12611 (longest request) -- Thanks, -Utkarsh -- Thanks, -Utkarsh
Re: Improving performance to return 2000+ documents
Hello Utkarsh, This may or may not be relevant for your use-case, but the way we deal with this scenario is to retrieve the top N documents 5,10,20or100 at a time (user selectable). We can then page the results, changing the start parameter to return the next set. This allows us to 'retrieve' millions of documents - we just do it at the user's leisure, rather than make them wait for the whole lot in one go. This works well because users very rarely want to see ALL 2000 (or whatever number) documents at one time - it's simply too much to take in at one time. If your use-case involves an automated or offline procedure (e.g. running a report or some data-mining op), then presumably it doesn't matter so much it takes a bit longer (as long as it returns in some reasonble time). Have you looked at doing paging on the client-side - this will hugely speed-up your search time. HTH Peter On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.comwrote: Well, depending on how many docs get served from the cache the time will vary. But this is just ugly, if you can avoid this use-case it would be a Good Thing. Problem here is that each and every shard must assemble the list of 2,000 documents (just ID and sort criteria, usually score). Then the node serving the original request merges the sub-lists to pick the top 2,000. Then the node sends another request to each shard to get the full document. Then the node merges this into the full list to return to the user. Solr really isn't built for this use-case, is it actually a compelling situation? And having your document cache set at 1M is kinda high if you have very big documents. FWIW, Erick On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total
Improving performance to return 2000+ documents
Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total transferred: 766221660 bytes HTML transferred: 766191806 bytes Requests per second:13.89 [#/sec] (mean) Time per request: 719.981 [ms] (mean) Time per request: 71.998 [ms] (mean, across all concurrent requests) Transfer rate: 20785.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.6 0 8 Processing: 9 717 2339.6199 12611 Waiting:9 635 2233.6164 12580 Total: 9 718 2339.6199 12611 Percentage of the requests served within a certain time (ms) 50%199 66%236 75%263 80%281 90%548 95%838 98% 12475 99% 12545 100% 12611 (longest request) -- Thanks, -Utkarsh
Re: Improving performance to return 2000+ documents
Also, I don't see a consistent response time from solr, I ran ab again and I get this: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname: x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 10.858 seconds Complete requests: 500 Failed requests:8 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0) Write errors: 0 Total transferred: 769297992 bytes HTML transferred: 769268492 bytes Requests per second:46.05 [#/sec] (mean) Time per request: 217.167 [ms] (mean) Time per request: 21.717 [ms] (mean, across all concurrent requests) Transfer rate: 69187.90 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.3 0 2 Processing: 110 215 72.0190 497 Waiting: 91 180 70.5152 473 Total:112 216 72.0191 497 Percentage of the requests served within a certain time (ms) 50%191 66%225 75%252 80%272 90%319 95%364 98%420 99%453 100%497 (longest request) Sometimes it takes a lot of time, sometimes its pretty quick. Thanks, -Utkarsh On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: Hello, I have a usecase where I need to retrive top 2000 documents matching a query. What are the parameters (in query, solrconfig, schema) I shoud look at to improve this? I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM, 8vCPU and 7GB JVM heap size. I have documentCache: documentCache class=solr.LRUCache size=100 initialSize=100 autowarmCount=0/ allText is a copyField. This is the result I get: ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Benchmarking x.amazonaws.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Finished 500 requests Server Software: Server Hostname:x.amazonaws.com Server Port:8983 Document Path: /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json Document Length:1538537 bytes Concurrency Level: 10 Time taken for tests: 35.999 seconds Complete requests: 500 Failed requests:21 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0) Write errors: 0 Non-2xx responses: 2 Total transferred: 766221660 bytes HTML transferred: 766191806 bytes Requests per second:13.89 [#/sec] (mean) Time per request: 719.981 [ms] (mean) Time per request: 71.998 [ms] (mean, across all concurrent requests) Transfer rate: 20785.65 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect:00 0.6 0 8 Processing: 9 717 2339.6199 12611 Waiting:9 635 2233.6164 12580 Total: 9 718 2339.6199 12611 Percentage of the requests served within a certain time (ms) 50%199 66%236 75%263 80%281 90%548 95%838 98% 12475 99% 12545 100% 12611 (longest request) -- Thanks, -Utkarsh -- Thanks, -Utkarsh