Re: Improving performance to return 2000+ documents

2013-07-01 Thread Utkarsh Sengar
Thanks Erick/Jagdish.

Just to give some background on my queries.

1. All my queries are unique. A query can be: ipod and ipod 8gb (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:
documentCache class=solr.LRUCache
   size=1
   initialSize=1
   autowarmCount=0
   cleanupThread=true/
//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh

queryResultCache class=solr.LRUCache
 size=512
 initialSize=512
 autowarmCount=0
 cleanupThread=true/
//Default, since my queries are unique.

filterCache class=solr.FastLRUCache
 size=512
 initialSize=512
 autowarmCount=0/
//Now sure how can I use filterCache, so I am keeping it as the default

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize100/queryResultWindowSize
queryResultMaxDocsCached100/queryResultMaxDocsCached


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a proxy around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote:

 Solrconfig.xml has got entries which you can tweak for your use case. One
 of them is queryresultwindowsize. You can try using the value of 2000 and
 see if it helps improving performance. Please make sure you have enough
 memory allocated for queryresultcache.
 A combination of sharding and distribution of workload(requesting
 2000/number of shards) with an aggregator would be a good way to maximize
 performance.

 Thanks,

 Jagdish


 On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  50M documents, depending on a bunch of things,
  may not be unreasonable for a single node, only
  testing will tell.
 
  But the question I have is whether you should be
  using standard Solr queries for this or building a custom
  component that goes at the base Lucene index
  and does the right thing. Or even re-indexing your
  entire corpus periodically to add this kind of data.
 
  FWIW,
  Erick
 
 
  On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Thanks Erick/Peter.
  
   This is an offline process, used by a relevancy engine implemented
 around
   solr. The engine computes boost scores for related keywords based on
   clickstream data.
   i.e.: say clickstream has: ipad=upc1,upc2,upc3
   I query solr with keyword: ipad (to get 2000 documents) and then
 make 3
   individual queries for upc1,upc2,upc3 (which are fast).
   The data is then used to compute related keywords to ipad with their
   boost values.
  
   So, I cannot really replace that, since I need full text search over my
   dataset to retrieve top 2000 documents.
  
   I tried paging: I retrieve 500 solr documents 4 times (0-500,
  500-1000...),
   but don't see any improvements.
  
  
   Some questions:
   1. Maybe the JVM size might help?
   This is what I see in the dashboard:
   Physical Memory 76.2%
   Swap Space NaN% (don't have any swap space, running on AWS EBS)
   File Descriptor Count 4.7%
   JVM-Memory 73.8%
  
   Screenshot: http://i.imgur.com/aegKzP6.png
  
   2. Will reducing the shards from 3 to 1 improve performance? (maybe
   increase the RAM from 30 to 60GB) The problem I will face in that case
  will
   be fitting 50M documents on 1 machine.
  
   Thanks,
   -Utkarsh
  
  
   On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we
 deal
   with
this scenario is to retrieve the top N documents 5,10,20or100 at a
 time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve'
 millions
   of
documents - we just do it at the user's leisure, rather than make
 them
   wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
   whatever
number) documents at one time - it's simply too much to take in at
 one
time.
If your use-case involves an automated or offline procedure (e.g.
   running a
report or some data-mining op), then presumably it doesn't matter so
  much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter
   
   
   
On Sat, Jun 29, 2013 at 6:17 PM, Erick 

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Utkarsh Sengar
Thanks Erick/Peter.

This is an offline process, used by a relevancy engine implemented around
solr. The engine computes boost scores for related keywords based on
clickstream data.
i.e.: say clickstream has: ipad=upc1,upc2,upc3
I query solr with keyword: ipad (to get 2000 documents) and then make 3
individual queries for upc1,upc2,upc3 (which are fast).
The data is then used to compute related keywords to ipad with their
boost values.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.


Some questions:
1. Maybe the JVM size might help?
This is what I see in the dashboard:
Physical Memory 76.2%
Swap Space NaN% (don't have any swap space, running on AWS EBS)
File Descriptor Count 4.7%
JVM-Memory 73.8%

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh


On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.comwrote:

 Hello Utkarsh,
 This may or may not be relevant for your use-case, but the way we deal with
 this scenario is to retrieve the top N documents 5,10,20or100 at a time
 (user selectable). We can then page the results, changing the start
 parameter to return the next set. This allows us to 'retrieve' millions of
 documents - we just do it at the user's leisure, rather than make them wait
 for the whole lot in one go.
 This works well because users very rarely want to see ALL 2000 (or whatever
 number) documents at one time - it's simply too much to take in at one
 time.
 If your use-case involves an automated or offline procedure (e.g. running a
 report or some data-mining op), then presumably it doesn't matter so much
 it takes a bit longer (as long as it returns in some reasonble time).
 Have you looked at doing paging on the client-side - this will hugely
 speed-up your search time.
 HTH
 Peter



 On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  Well, depending on how many docs get served
  from the cache the time will vary. But this is
  just ugly, if you can avoid this use-case it would
  be a Good Thing.
 
  Problem here is that each and every shard must
  assemble the list of 2,000 documents (just ID and
  sort criteria, usually score).
 
  Then the node serving the original request merges
  the sub-lists to pick the top 2,000. Then the node
  sends another request to each shard to get
  the full document. Then the node merges this
  into the full list to return to the user.
 
  Solr really isn't built for this use-case, is it actually
  a compelling situation?
 
  And having your document cache set at 1M is kinda
  high if you have very big documents.
 
  FWIW,
  Erick
 
 
  On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Also, I don't see a consistent response time from solr, I ran ab again
  and
   I get this:
  
   ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
  
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   
  
  
   Benchmarking x.amazonaws.com (be patient)
   Completed 100 requests
   Completed 200 requests
   Completed 300 requests
   Completed 400 requests
   Completed 500 requests
   Finished 500 requests
  
  
   Server Software:
   Server Hostname:   x.amazonaws.com
   Server Port:8983
  
   Document Path:
  
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   Document Length:1538537 bytes
  
   Concurrency Level:  10
   Time taken for tests:   10.858 seconds
   Complete requests:  500
   Failed requests:8
  (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
   Write errors:   0
   Total transferred:  769297992 bytes
   HTML transferred:   769268492 bytes
   Requests per second:46.05 [#/sec] (mean)
   Time per request:   217.167 [ms] (mean)
   Time per request:   21.717 [ms] (mean, across all concurrent
  requests)
   Transfer rate:  69187.90 [Kbytes/sec] received
  
   Connection Times (ms)
 min  mean[+/-sd] median   max
   Connect:00   0.3  0   2
   Processing:   110  215  72.0190 497
   Waiting:   91  180  70.5152 473
   Total:112  216  72.0191 497
  
   Percentage of the requests served within a certain time (ms)
 50%191
 66%225
 75%252
 80%272
 90%319
 95%364
 98%420
 99%453
100%497 (longest request)
  
  
   Sometimes it takes a lot of time, sometimes its pretty quick.
  
   Thanks,
   -Utkarsh
  
  
   On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
   wrote:
  
Hello,
   
I have a 

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Erick Erickson
50M documents, depending on a bunch of things,
may not be unreasonable for a single node, only
testing will tell.

But the question I have is whether you should be
using standard Solr queries for this or building a custom
component that goes at the base Lucene index
and does the right thing. Or even re-indexing your
entire corpus periodically to add this kind of data.

FWIW,
Erick


On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Thanks Erick/Peter.

 This is an offline process, used by a relevancy engine implemented around
 solr. The engine computes boost scores for related keywords based on
 clickstream data.
 i.e.: say clickstream has: ipad=upc1,upc2,upc3
 I query solr with keyword: ipad (to get 2000 documents) and then make 3
 individual queries for upc1,upc2,upc3 (which are fast).
 The data is then used to compute related keywords to ipad with their
 boost values.

 So, I cannot really replace that, since I need full text search over my
 dataset to retrieve top 2000 documents.

 I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
 but don't see any improvements.


 Some questions:
 1. Maybe the JVM size might help?
 This is what I see in the dashboard:
 Physical Memory 76.2%
 Swap Space NaN% (don't have any swap space, running on AWS EBS)
 File Descriptor Count 4.7%
 JVM-Memory 73.8%

 Screenshot: http://i.imgur.com/aegKzP6.png

 2. Will reducing the shards from 3 to 1 improve performance? (maybe
 increase the RAM from 30 to 60GB) The problem I will face in that case will
 be fitting 50M documents on 1 machine.

 Thanks,
 -Utkarsh


 On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
 wrote:

  Hello Utkarsh,
  This may or may not be relevant for your use-case, but the way we deal
 with
  this scenario is to retrieve the top N documents 5,10,20or100 at a time
  (user selectable). We can then page the results, changing the start
  parameter to return the next set. This allows us to 'retrieve' millions
 of
  documents - we just do it at the user's leisure, rather than make them
 wait
  for the whole lot in one go.
  This works well because users very rarely want to see ALL 2000 (or
 whatever
  number) documents at one time - it's simply too much to take in at one
  time.
  If your use-case involves an automated or offline procedure (e.g.
 running a
  report or some data-mining op), then presumably it doesn't matter so much
  it takes a bit longer (as long as it returns in some reasonble time).
  Have you looked at doing paging on the client-side - this will hugely
  speed-up your search time.
  HTH
  Peter
 
 
 
  On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Well, depending on how many docs get served
   from the cache the time will vary. But this is
   just ugly, if you can avoid this use-case it would
   be a Good Thing.
  
   Problem here is that each and every shard must
   assemble the list of 2,000 documents (just ID and
   sort criteria, usually score).
  
   Then the node serving the original request merges
   the sub-lists to pick the top 2,000. Then the node
   sends another request to each shard to get
   the full document. Then the node merges this
   into the full list to return to the user.
  
   Solr really isn't built for this use-case, is it actually
   a compelling situation?
  
   And having your document cache set at 1M is kinda
   high if you have very big documents.
  
   FWIW,
   Erick
  
  
   On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
   wrote:
  
Also, I don't see a consistent response time from solr, I ran ab
 again
   and
I get this:
   
ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
   
   
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json

   
   
Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests
   
   
Server Software:
Server Hostname:   x.amazonaws.com
Server Port:8983
   
Document Path:
   
   
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes
   
Concurrency Level:  10
Time taken for tests:   10.858 seconds
Complete requests:  500
Failed requests:8
   (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors:   0
Total transferred:  769297992 bytes
HTML transferred:   769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request:   217.167 [ms] (mean)
Time per request:   21.717 [ms] (mean, across all concurrent
   requests)
Transfer rate:  69187.90 [Kbytes/sec] received
   
Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.3  0  

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Jagdish Nomula
Solrconfig.xml has got entries which you can tweak for your use case. One
of them is queryresultwindowsize. You can try using the value of 2000 and
see if it helps improving performance. Please make sure you have enough
memory allocated for queryresultcache.
A combination of sharding and distribution of workload(requesting
2000/number of shards) with an aggregator would be a good way to maximize
performance.

Thanks,

Jagdish


On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.comwrote:

 50M documents, depending on a bunch of things,
 may not be unreasonable for a single node, only
 testing will tell.

 But the question I have is whether you should be
 using standard Solr queries for this or building a custom
 component that goes at the base Lucene index
 and does the right thing. Or even re-indexing your
 entire corpus periodically to add this kind of data.

 FWIW,
 Erick


 On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Thanks Erick/Peter.
 
  This is an offline process, used by a relevancy engine implemented around
  solr. The engine computes boost scores for related keywords based on
  clickstream data.
  i.e.: say clickstream has: ipad=upc1,upc2,upc3
  I query solr with keyword: ipad (to get 2000 documents) and then make 3
  individual queries for upc1,upc2,upc3 (which are fast).
  The data is then used to compute related keywords to ipad with their
  boost values.
 
  So, I cannot really replace that, since I need full text search over my
  dataset to retrieve top 2000 documents.
 
  I tried paging: I retrieve 500 solr documents 4 times (0-500,
 500-1000...),
  but don't see any improvements.
 
 
  Some questions:
  1. Maybe the JVM size might help?
  This is what I see in the dashboard:
  Physical Memory 76.2%
  Swap Space NaN% (don't have any swap space, running on AWS EBS)
  File Descriptor Count 4.7%
  JVM-Memory 73.8%
 
  Screenshot: http://i.imgur.com/aegKzP6.png
 
  2. Will reducing the shards from 3 to 1 improve performance? (maybe
  increase the RAM from 30 to 60GB) The problem I will face in that case
 will
  be fitting 50M documents on 1 machine.
 
  Thanks,
  -Utkarsh
 
 
  On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
  wrote:
 
   Hello Utkarsh,
   This may or may not be relevant for your use-case, but the way we deal
  with
   this scenario is to retrieve the top N documents 5,10,20or100 at a time
   (user selectable). We can then page the results, changing the start
   parameter to return the next set. This allows us to 'retrieve' millions
  of
   documents - we just do it at the user's leisure, rather than make them
  wait
   for the whole lot in one go.
   This works well because users very rarely want to see ALL 2000 (or
  whatever
   number) documents at one time - it's simply too much to take in at one
   time.
   If your use-case involves an automated or offline procedure (e.g.
  running a
   report or some data-mining op), then presumably it doesn't matter so
 much
   it takes a bit longer (as long as it returns in some reasonble time).
   Have you looked at doing paging on the client-side - this will hugely
   speed-up your search time.
   HTH
   Peter
  
  
  
   On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.
   
Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).
   
Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.
   
Solr really isn't built for this use-case, is it actually
a compelling situation?
   
And having your document cache set at 1M is kinda
high if you have very big documents.
   
FWIW,
Erick
   
   
On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar 
 utkarsh2...@gmail.com
wrote:
   
 Also, I don't see a consistent response time from solr, I ran ab
  again
and
 I get this:

 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 


   
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 


 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:   x.amazonaws.com
 Server Port:8983

 Document Path:


   
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 

Re: Improving performance to return 2000+ documents

2013-06-29 Thread Erick Erickson
Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick


On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Also, I don't see a consistent response time from solr, I ran ab again and
 I get this:

 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 

 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 


 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:   x.amazonaws.com
 Server Port:8983

 Document Path:

 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   10.858 seconds
 Complete requests:  500
 Failed requests:8
(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
 Write errors:   0
 Total transferred:  769297992 bytes
 HTML transferred:   769268492 bytes
 Requests per second:46.05 [#/sec] (mean)
 Time per request:   217.167 [ms] (mean)
 Time per request:   21.717 [ms] (mean, across all concurrent requests)
 Transfer rate:  69187.90 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.3  0   2
 Processing:   110  215  72.0190 497
 Waiting:   91  180  70.5152 473
 Total:112  216  72.0191 497

 Percentage of the requests served within a certain time (ms)
   50%191
   66%225
   75%252
   80%272
   90%319
   95%364
   98%420
   99%453
  100%497 (longest request)


 Sometimes it takes a lot of time, sometimes its pretty quick.

 Thanks,
 -Utkarsh


 On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Hello,
 
  I have a usecase where I need to retrive top 2000 documents matching a
  query.
  What are the parameters (in query, solrconfig, schema) I shoud look at to
  improve this?
 
  I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
  RAM, 8vCPU and 7GB JVM heap size.
 
  I have documentCache:
documentCache class=solr.LRUCache  size=100
  initialSize=100   autowarmCount=0/
 
  allText is a copyField.
 
  This is the result I get:
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   35.999 seconds
  Complete requests:  500
  Failed requests:21
 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
  Write errors:   0
  Non-2xx responses:  2
  Total transferred:  766221660 bytes
  HTML transferred:   766191806 bytes
  Requests per second:13.89 [#/sec] (mean)
  Time per request:   719.981 [ms] (mean)
  Time per request:   71.998 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  20785.65 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.6  0   8
  Processing: 9  717 2339.6199   12611
  Waiting:9  635 2233.6164   12580
  Total:  9  718 2339.6199   12611
 
  Percentage of the requests served within a certain time (ms)
50%199
66%236
75%263
80%281
90%548
95%838
98%  12475
99%  12545
   100%  12611 (longest request)
 
  --
  Thanks,
  -Utkarsh
 



 --
 Thanks,
 -Utkarsh



Re: Improving performance to return 2000+ documents

2013-06-29 Thread Peter Sturge
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter



On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, depending on how many docs get served
 from the cache the time will vary. But this is
 just ugly, if you can avoid this use-case it would
 be a Good Thing.

 Problem here is that each and every shard must
 assemble the list of 2,000 documents (just ID and
 sort criteria, usually score).

 Then the node serving the original request merges
 the sub-lists to pick the top 2,000. Then the node
 sends another request to each shard to get
 the full document. Then the node merges this
 into the full list to return to the user.

 Solr really isn't built for this use-case, is it actually
 a compelling situation?

 And having your document cache set at 1M is kinda
 high if you have very big documents.

 FWIW,
 Erick


 On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Also, I don't see a consistent response time from solr, I ran ab again
 and
  I get this:
 
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:   x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   10.858 seconds
  Complete requests:  500
  Failed requests:8
 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
  Write errors:   0
  Total transferred:  769297992 bytes
  HTML transferred:   769268492 bytes
  Requests per second:46.05 [#/sec] (mean)
  Time per request:   217.167 [ms] (mean)
  Time per request:   21.717 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  69187.90 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.3  0   2
  Processing:   110  215  72.0190 497
  Waiting:   91  180  70.5152 473
  Total:112  216  72.0191 497
 
  Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
   100%497 (longest request)
 
 
  Sometimes it takes a lot of time, sometimes its pretty quick.
 
  Thanks,
  -Utkarsh
 
 
  On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Hello,
  
   I have a usecase where I need to retrive top 2000 documents matching a
   query.
   What are the parameters (in query, solrconfig, schema) I shoud look at
 to
   improve this?
  
   I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
   RAM, 8vCPU and 7GB JVM heap size.
  
   I have documentCache:
 documentCache class=solr.LRUCache  size=100
   initialSize=100   autowarmCount=0/
  
   allText is a copyField.
  
   This is the result I get:
   ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   
  
   Benchmarking x.amazonaws.com (be patient)
   Completed 100 requests
   Completed 200 requests
   Completed 300 requests
   Completed 400 requests
   Completed 500 requests
   Finished 500 requests
  
  
   Server Software:
   Server Hostname:x.amazonaws.com
   Server Port:8983
  
   Document Path:
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   Document Length:1538537 bytes
  
   Concurrency Level:  10
   Time taken for tests:   35.999 seconds
   Complete requests:  500
   Failed requests:21
  (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
   Write errors:   0
   Non-2xx responses:  2
   Total 

Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar
Hello,

I have a usecase where I need to retrive top 2000 documents matching a
query.
What are the parameters (in query, solrconfig, schema) I shoud look at to
improve this?

I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM,
8vCPU and 7GB JVM heap size.

I have documentCache:
  documentCache class=solr.LRUCache  size=100
initialSize=100   autowarmCount=0/

allText is a copyField.

This is the result I get:
ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json


Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   35.999 seconds
Complete requests:  500
Failed requests:21
   (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
Write errors:   0
Non-2xx responses:  2
Total transferred:  766221660 bytes
HTML transferred:   766191806 bytes
Requests per second:13.89 [#/sec] (mean)
Time per request:   719.981 [ms] (mean)
Time per request:   71.998 [ms] (mean, across all concurrent requests)
Transfer rate:  20785.65 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.6  0   8
Processing: 9  717 2339.6199   12611
Waiting:9  635 2233.6164   12580
Total:  9  718 2339.6199   12611

Percentage of the requests served within a certain time (ms)
  50%199
  66%236
  75%263
  80%281
  90%548
  95%838
  98%  12475
  99%  12545
 100%  12611 (longest request)

-- 
Thanks,
-Utkarsh


Re: Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar
Also, I don't see a consistent response time from solr, I ran ab again and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json



Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:   x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   10.858 seconds
Complete requests:  500
Failed requests:8
   (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors:   0
Total transferred:  769297992 bytes
HTML transferred:   769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request:   217.167 [ms] (mean)
Time per request:   21.717 [ms] (mean, across all concurrent requests)
Transfer rate:  69187.90 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.3  0   2
Processing:   110  215  72.0190 497
Waiting:   91  180  70.5152 473
Total:112  216  72.0191 497

Percentage of the requests served within a certain time (ms)
  50%191
  66%225
  75%252
  80%272
  90%319
  95%364
  98%420
  99%453
 100%497 (longest request)


Sometimes it takes a lot of time, sometimes its pretty quick.

Thanks,
-Utkarsh


On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I have a usecase where I need to retrive top 2000 documents matching a
 query.
 What are the parameters (in query, solrconfig, schema) I shoud look at to
 improve this?

 I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
 RAM, 8vCPU and 7GB JVM heap size.

 I have documentCache:
   documentCache class=solr.LRUCache  size=100
 initialSize=100   autowarmCount=0/

 allText is a copyField.

 This is the result I get:
 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 

 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:x.amazonaws.com
 Server Port:8983

 Document Path:
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   35.999 seconds
 Complete requests:  500
 Failed requests:21
(Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
 Write errors:   0
 Non-2xx responses:  2
 Total transferred:  766221660 bytes
 HTML transferred:   766191806 bytes
 Requests per second:13.89 [#/sec] (mean)
 Time per request:   719.981 [ms] (mean)
 Time per request:   71.998 [ms] (mean, across all concurrent requests)
 Transfer rate:  20785.65 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.6  0   8
 Processing: 9  717 2339.6199   12611
 Waiting:9  635 2233.6164   12580
 Total:  9  718 2339.6199   12611

 Percentage of the requests served within a certain time (ms)
   50%199
   66%236
   75%263
   80%281
   90%548
   95%838
   98%  12475
   99%  12545
  100%  12611 (longest request)

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh