subject:"Improving performance to return 2000\+ documents"

Re: Improving performance to return 2000+ documents

2013-07-01 Thread Utkarsh Sengar

Thanks Erick/Jagdish.

Just to give some background on my queries.

1. All my queries are unique. A query can be: ipod and ipod 8gb (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:
documentCache class=solr.LRUCache
   size=1
   initialSize=1
   autowarmCount=0
   cleanupThread=true/
//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh

queryResultCache class=solr.LRUCache
 size=512
 initialSize=512
 autowarmCount=0
 cleanupThread=true/
//Default, since my queries are unique.

filterCache class=solr.FastLRUCache
 size=512
 initialSize=512
 autowarmCount=0/
//Now sure how can I use filterCache, so I am keeping it as the default

enableLazyFieldLoadingtrue/enableLazyFieldLoading
queryResultWindowSize100/queryResultWindowSize
queryResultMaxDocsCached100/queryResultMaxDocsCached


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a proxy around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula jagd...@simplyhired.comwrote:

 Solrconfig.xml has got entries which you can tweak for your use case. One
 of them is queryresultwindowsize. You can try using the value of 2000 and
 see if it helps improving performance. Please make sure you have enough
 memory allocated for queryresultcache.
 A combination of sharding and distribution of workload(requesting
 2000/number of shards) with an aggregator would be a good way to maximize
 performance.

 Thanks,

 Jagdish


 On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  50M documents, depending on a bunch of things,
  may not be unreasonable for a single node, only
  testing will tell.
 
  But the question I have is whether you should be
  using standard Solr queries for this or building a custom
  component that goes at the base Lucene index
  and does the right thing. Or even re-indexing your
  entire corpus periodically to add this kind of data.
 
  FWIW,
  Erick
 
 
  On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Thanks Erick/Peter.
  
   This is an offline process, used by a relevancy engine implemented
 around
   solr. The engine computes boost scores for related keywords based on
   clickstream data.
   i.e.: say clickstream has: ipad=upc1,upc2,upc3
   I query solr with keyword: ipad (to get 2000 documents) and then
 make 3
   individual queries for upc1,upc2,upc3 (which are fast).
   The data is then used to compute related keywords to ipad with their
   boost values.
  
   So, I cannot really replace that, since I need full text search over my
   dataset to retrieve top 2000 documents.
  
   I tried paging: I retrieve 500 solr documents 4 times (0-500,
  500-1000...),
   but don't see any improvements.
  
  
   Some questions:
   1. Maybe the JVM size might help?
   This is what I see in the dashboard:
   Physical Memory 76.2%
   Swap Space NaN% (don't have any swap space, running on AWS EBS)
   File Descriptor Count 4.7%
   JVM-Memory 73.8%
  
   Screenshot: http://i.imgur.com/aegKzP6.png
  
   2. Will reducing the shards from 3 to 1 improve performance? (maybe
   increase the RAM from 30 to 60GB) The problem I will face in that case
  will
   be fitting 50M documents on 1 machine.
  
   Thanks,
   -Utkarsh
  
  
   On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
   wrote:
  
Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we
 deal
   with
this scenario is to retrieve the top N documents 5,10,20or100 at a
 time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve'
 millions
   of
documents - we just do it at the user's leisure, rather than make
 them
   wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
   whatever
number) documents at one time - it's simply too much to take in at
 one
time.
If your use-case involves an automated or offline procedure (e.g.
   running a
report or some data-mining op), then presumably it doesn't matter so
  much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter
   
   
   
On Sat, Jun 29, 2013 at 6:17 PM, Erick

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Utkarsh Sengar

Thanks Erick/Peter.

This is an offline process, used by a relevancy engine implemented around
solr. The engine computes boost scores for related keywords based on
clickstream data.
i.e.: say clickstream has: ipad=upc1,upc2,upc3
I query solr with keyword: ipad (to get 2000 documents) and then make 3
individual queries for upc1,upc2,upc3 (which are fast).
The data is then used to compute related keywords to ipad with their
boost values.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.

Some questions:
1. Maybe the JVM size might help?
This is what I see in the dashboard:
Physical Memory 76.2%
Swap Space NaN% (don't have any swap space, running on AWS EBS)
File Descriptor Count 4.7%
JVM-Memory 73.8%

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh

On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.comwrote:

Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one
time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter

On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com
wrote:

Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick

On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Also, I don't see a consistent response time from solr, I ran ab again
and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500

http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json

Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests

Server Software:
Server Hostname: x.amazonaws.com
Server Port:8983

Document Path:

/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level: 10
Time taken for tests: 10.858 seconds
Complete requests: 500
Failed requests:8
(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors: 0
Total transferred: 769297992 bytes
HTML transferred: 769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request: 217.167 [ms] (mean)
Time per request: 21.717 [ms] (mean, across all concurrent
requests)
Transfer rate: 69187.90 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect:00 0.3 0 2
Processing: 110 215 72.0190 497
Waiting: 91 180 70.5152 473
Total:112 216 72.0191 497

Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
100%497 (longest request)

Sometimes it takes a lot of time, sometimes its pretty quick.

Thanks,
-Utkarsh

On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Hello,

I have a

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Erick Erickson

50M documents, depending on a bunch of things,
may not be unreasonable for a single node, only
testing will tell.

But the question I have is whether you should be
using standard Solr queries for this or building a custom
component that goes at the base Lucene index
and does the right thing. Or even re-indexing your
entire corpus periodically to add this kind of data.

FWIW,
Erick

On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

Thanks Erick/Peter.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh

On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
wrote:

Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal
with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions
of
documents - we just do it at the user's leisure, rather than make them
wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
whatever
number) documents at one time - it's simply too much to take in at one
time.
If your use-case involves an automated or offline procedure (e.g.
running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter

On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.com
wrote:

Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick

On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Also, I don't see a consistent response time from solr, I ran ab
again
and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500

http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json

Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests

Server Software:
Server Hostname: x.amazonaws.com
Server Port:8983

Document Path:

/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Connection Times (ms)
min mean[+/-sd] median max
Connect:00 0.3 0

Re: Improving performance to return 2000+ documents

2013-06-30 Thread Jagdish Nomula

Solrconfig.xml has got entries which you can tweak for your use case. One
of them is queryresultwindowsize. You can try using the value of 2000 and
see if it helps improving performance. Please make sure you have enough
memory allocated for queryresultcache.
A combination of sharding and distribution of workload(requesting
2000/number of shards) with an aggregator would be a good way to maximize
performance.

Thanks,

Jagdish

On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson erickerick...@gmail.comwrote:

50M documents, depending on a bunch of things,
may not be unreasonable for a single node, only
testing will tell.

FWIW,
Erick

On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar utkarsh2...@gmail.com
wrote:

Thanks Erick/Peter.

So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500,
500-1000...),
but don't see any improvements.

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case
will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh

On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge peter.stu...@gmail.com
wrote:

Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal
with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions
of
documents - we just do it at the user's leisure, rather than make them
wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or
whatever
number) documents at one time - it's simply too much to take in at one
time.
If your use-case involves an automated or offline procedure (e.g.
running a
report or some data-mining op), then presumably it doesn't matter so
much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter

On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson
erickerick...@gmail.com
wrote:

Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick

On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar
utkarsh2...@gmail.com
wrote:

Also, I don't see a consistent response time from solr, I ran ab
again
and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500

http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json

Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests

Server Software:
Server Hostname: x.amazonaws.com
Server Port:8983

Document Path:

/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Re: Improving performance to return 2000+ documents

2013-06-29 Thread Erick Erickson

Well, depending on how many docs get served
from the cache the time will vary. But this is
just ugly, if you can avoid this use-case it would
be a Good Thing.

Problem here is that each and every shard must
assemble the list of 2,000 documents (just ID and
sort criteria, usually score).

Then the node serving the original request merges
the sub-lists to pick the top 2,000. Then the node
sends another request to each shard to get
the full document. Then the node merges this
into the full list to return to the user.

Solr really isn't built for this use-case, is it actually
a compelling situation?

And having your document cache set at 1M is kinda
high if you have very big documents.

FWIW,
Erick


On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Also, I don't see a consistent response time from solr, I ran ab again and
 I get this:

 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 

 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 


 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:   x.amazonaws.com
 Server Port:8983

 Document Path:

 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   10.858 seconds
 Complete requests:  500
 Failed requests:8
(Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
 Write errors:   0
 Total transferred:  769297992 bytes
 HTML transferred:   769268492 bytes
 Requests per second:46.05 [#/sec] (mean)
 Time per request:   217.167 [ms] (mean)
 Time per request:   21.717 [ms] (mean, across all concurrent requests)
 Transfer rate:  69187.90 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.3  0   2
 Processing:   110  215  72.0190 497
 Waiting:   91  180  70.5152 473
 Total:112  216  72.0191 497

 Percentage of the requests served within a certain time (ms)
   50%191
   66%225
   75%252
   80%272
   90%319
   95%364
   98%420
   99%453
  100%497 (longest request)


 Sometimes it takes a lot of time, sometimes its pretty quick.

 Thanks,
 -Utkarsh


 On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Hello,
 
  I have a usecase where I need to retrive top 2000 documents matching a
  query.
  What are the parameters (in query, solrconfig, schema) I shoud look at to
  improve this?
 
  I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
  RAM, 8vCPU and 7GB JVM heap size.
 
  I have documentCache:
documentCache class=solr.LRUCache  size=100
  initialSize=100   autowarmCount=0/
 
  allText is a copyField.
 
  This is the result I get:
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   35.999 seconds
  Complete requests:  500
  Failed requests:21
 (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
  Write errors:   0
  Non-2xx responses:  2
  Total transferred:  766221660 bytes
  HTML transferred:   766191806 bytes
  Requests per second:13.89 [#/sec] (mean)
  Time per request:   719.981 [ms] (mean)
  Time per request:   71.998 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  20785.65 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.6  0   8
  Processing: 9  717 2339.6199   12611
  Waiting:9  635 2233.6164   12580
  Total:  9  718 2339.6199   12611
 
  Percentage of the requests served within a certain time (ms)
50%199
66%236
75%263
80%281
90%548
95%838
98%  12475
99%  12545
   100%  12611 (longest request)
 
  --
  Thanks,
  -Utkarsh
 



 --
 Thanks,
 -Utkarsh

Re: Improving performance to return 2000+ documents

2013-06-29 Thread Peter Sturge

Hello Utkarsh,
This may or may not be relevant for your use-case, but the way we deal with
this scenario is to retrieve the top N documents 5,10,20or100 at a time
(user selectable). We can then page the results, changing the start
parameter to return the next set. This allows us to 'retrieve' millions of
documents - we just do it at the user's leisure, rather than make them wait
for the whole lot in one go.
This works well because users very rarely want to see ALL 2000 (or whatever
number) documents at one time - it's simply too much to take in at one time.
If your use-case involves an automated or offline procedure (e.g. running a
report or some data-mining op), then presumably it doesn't matter so much
it takes a bit longer (as long as it returns in some reasonble time).
Have you looked at doing paging on the client-side - this will hugely
speed-up your search time.
HTH
Peter



On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson erickerick...@gmail.comwrote:

 Well, depending on how many docs get served
 from the cache the time will vary. But this is
 just ugly, if you can avoid this use-case it would
 be a Good Thing.

 Problem here is that each and every shard must
 assemble the list of 2,000 documents (just ID and
 sort criteria, usually score).

 Then the node serving the original request merges
 the sub-lists to pick the top 2,000. Then the node
 sends another request to each shard to get
 the full document. Then the node merges this
 into the full list to return to the user.

 Solr really isn't built for this use-case, is it actually
 a compelling situation?

 And having your document cache set at 1M is kinda
 high if you have very big documents.

 FWIW,
 Erick


 On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar utkarsh2...@gmail.com
 wrote:

  Also, I don't see a consistent response time from solr, I ran ab again
 and
  I get this:
 
  ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  
 
 
  Benchmarking x.amazonaws.com (be patient)
  Completed 100 requests
  Completed 200 requests
  Completed 300 requests
  Completed 400 requests
  Completed 500 requests
  Finished 500 requests
 
 
  Server Software:
  Server Hostname:   x.amazonaws.com
  Server Port:8983
 
  Document Path:
 
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
  Document Length:1538537 bytes
 
  Concurrency Level:  10
  Time taken for tests:   10.858 seconds
  Complete requests:  500
  Failed requests:8
 (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
  Write errors:   0
  Total transferred:  769297992 bytes
  HTML transferred:   769268492 bytes
  Requests per second:46.05 [#/sec] (mean)
  Time per request:   217.167 [ms] (mean)
  Time per request:   21.717 [ms] (mean, across all concurrent
 requests)
  Transfer rate:  69187.90 [Kbytes/sec] received
 
  Connection Times (ms)
min  mean[+/-sd] median   max
  Connect:00   0.3  0   2
  Processing:   110  215  72.0190 497
  Waiting:   91  180  70.5152 473
  Total:112  216  72.0191 497
 
  Percentage of the requests served within a certain time (ms)
50%191
66%225
75%252
80%272
90%319
95%364
98%420
99%453
   100%497 (longest request)
 
 
  Sometimes it takes a lot of time, sometimes its pretty quick.
 
  Thanks,
  -Utkarsh
 
 
  On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.com
  wrote:
 
   Hello,
  
   I have a usecase where I need to retrive top 2000 documents matching a
   query.
   What are the parameters (in query, solrconfig, schema) I shoud look at
 to
   improve this?
  
   I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
   RAM, 8vCPU and 7GB JVM heap size.
  
   I have documentCache:
 documentCache class=solr.LRUCache  size=100
   initialSize=100   autowarmCount=0/
  
   allText is a copyField.
  
   This is the result I get:
   ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
  
 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   
  
   Benchmarking x.amazonaws.com (be patient)
   Completed 100 requests
   Completed 200 requests
   Completed 300 requests
   Completed 400 requests
   Completed 500 requests
   Finished 500 requests
  
  
   Server Software:
   Server Hostname:x.amazonaws.com
   Server Port:8983
  
   Document Path:
  
 
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
   Document Length:1538537 bytes
  
   Concurrency Level:  10
   Time taken for tests:   35.999 seconds
   Complete requests:  500
   Failed requests:21
  (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
   Write errors:   0
   Non-2xx responses:  2
   Total

Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar

Hello,

I have a usecase where I need to retrive top 2000 documents matching a
query.
What are the parameters (in query, solrconfig, schema) I shoud look at to
improve this?

I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB RAM,
8vCPU and 7GB JVM heap size.

I have documentCache:
  documentCache class=solr.LRUCache  size=100
initialSize=100   autowarmCount=0/

allText is a copyField.

This is the result I get:
ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json


Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   35.999 seconds
Complete requests:  500
Failed requests:21
   (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
Write errors:   0
Non-2xx responses:  2
Total transferred:  766221660 bytes
HTML transferred:   766191806 bytes
Requests per second:13.89 [#/sec] (mean)
Time per request:   719.981 [ms] (mean)
Time per request:   71.998 [ms] (mean, across all concurrent requests)
Transfer rate:  20785.65 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.6  0   8
Processing: 9  717 2339.6199   12611
Waiting:9  635 2233.6164   12580
Total:  9  718 2339.6199   12611

Percentage of the requests served within a certain time (ms)
  50%199
  66%236
  75%263
  80%281
  90%548
  95%838
  98%  12475
  99%  12545
 100%  12611 (longest request)

-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

2013-06-28 Thread Utkarsh Sengar

Also, I don't see a consistent response time from solr, I ran ab again and
I get this:

ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json



Benchmarking x.amazonaws.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Finished 500 requests


Server Software:
Server Hostname:   x.amazonaws.com
Server Port:8983

Document Path:
/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
Document Length:1538537 bytes

Concurrency Level:  10
Time taken for tests:   10.858 seconds
Complete requests:  500
Failed requests:8
   (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
Write errors:   0
Total transferred:  769297992 bytes
HTML transferred:   769268492 bytes
Requests per second:46.05 [#/sec] (mean)
Time per request:   217.167 [ms] (mean)
Time per request:   21.717 [ms] (mean, across all concurrent requests)
Transfer rate:  69187.90 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.3  0   2
Processing:   110  215  72.0190 497
Waiting:   91  180  70.5152 473
Total:112  216  72.0191 497

Percentage of the requests served within a certain time (ms)
  50%191
  66%225
  75%252
  80%272
  90%319
  95%364
  98%420
  99%453
 100%497 (longest request)


Sometimes it takes a lot of time, sometimes its pretty quick.

Thanks,
-Utkarsh


On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 Hello,

 I have a usecase where I need to retrive top 2000 documents matching a
 query.
 What are the parameters (in query, solrconfig, schema) I shoud look at to
 improve this?

 I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with 30GB
 RAM, 8vCPU and 7GB JVM heap size.

 I have documentCache:
   documentCache class=solr.LRUCache  size=100
 initialSize=100   autowarmCount=0/

 allText is a copyField.

 This is the result I get:
 ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 
 http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 

 Benchmarking x.amazonaws.com (be patient)
 Completed 100 requests
 Completed 200 requests
 Completed 300 requests
 Completed 400 requests
 Completed 500 requests
 Finished 500 requests


 Server Software:
 Server Hostname:x.amazonaws.com
 Server Port:8983

 Document Path:
 /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201rows=2000wt=json
 Document Length:1538537 bytes

 Concurrency Level:  10
 Time taken for tests:   35.999 seconds
 Complete requests:  500
 Failed requests:21
(Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
 Write errors:   0
 Non-2xx responses:  2
 Total transferred:  766221660 bytes
 HTML transferred:   766191806 bytes
 Requests per second:13.89 [#/sec] (mean)
 Time per request:   719.981 [ms] (mean)
 Time per request:   71.998 [ms] (mean, across all concurrent requests)
 Transfer rate:  20785.65 [Kbytes/sec] received

 Connection Times (ms)
   min  mean[+/-sd] median   max
 Connect:00   0.6  0   8
 Processing: 9  717 2339.6199   12611
 Waiting:9  635 2233.6164   12580
 Total:  9  718 2339.6199   12611

 Percentage of the requests served within a certain time (ms)
   50%199
   66%236
   75%263
   80%281
   90%548
   95%838
   98%  12475
   99%  12545
  100%  12611 (longest request)

 --
 Thanks,
 -Utkarsh




-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

Improving performance to return 2000+ documents

Re: Improving performance to return 2000+ documents

8 matches

Site Navigation

Mail list logo

Footer information