Re: Improving performance to return 2000+ documents

Utkarsh Sengar Sun, 30 Jun 2013 11:01:35 -0700

Thanks Erick/Peter.

This is an offline process, used by a relevancy engine implemented around
solr. The engine computes boost scores for related keywords based on
clickstream data.
i.e.: say clickstream has: ipad=upc1,upc2,upc3
I query solr with keyword: "ipad" (to get 2000 documents) and then make 3
individual queries for upc1,upc2,upc3 (which are fast).
The data is then used to compute related keywords to "ipad" with their
boost values.


So, I cannot really replace that, since I need full text search over my
dataset to retrieve top 2000 documents.

I tried paging: I retrieve 500 solr documents 4 times (0-500, 500-1000...),
but don't see any improvements.


Some questions:
1. Maybe the JVM size might help?
This is what I see in the dashboard:
Physical Memory 76.2%
Swap Space NaN% (don't have any swap space, running on AWS EBS)
File Descriptor Count 4.7%
JVM-Memory 73.8%

Screenshot: http://i.imgur.com/aegKzP6.png

2. Will reducing the shards from 3 to 1 improve performance? (maybe
increase the RAM from 30 to 60GB) The problem I will face in that case will
be fitting 50M documents on 1 machine.

Thanks,
-Utkarsh


On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge <peter.stu...@gmail.com>wrote:

> Hello Utkarsh,
> This may or may not be relevant for your use-case, but the way we deal with
> this scenario is to retrieve the top N documents 5,10,20or100 at a time
> (user selectable). We can then page the results, changing the start
> parameter to return the next set. This allows us to 'retrieve' millions of
> documents - we just do it at the user's leisure, rather than make them wait
> for the whole lot in one go.
> This works well because users very rarely want to see ALL 2000 (or whatever
> number) documents at one time - it's simply too much to take in at one
> time.
> If your use-case involves an automated or offline procedure (e.g. running a
> report or some data-mining op), then presumably it doesn't matter so much
> it takes a bit longer (as long as it returns in some reasonble time).
> Have you looked at doing paging on the client-side - this will hugely
> speed-up your search time.
> HTH
> Peter
>
>
>
> On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Well, depending on how many docs get served
> > from the cache the time will vary. But this is
> > just ugly, if you can avoid this use-case it would
> > be a Good Thing.
> >
> > Problem here is that each and every shard must
> > assemble the list of 2,000 documents (just ID and
> > sort criteria, usually score).
> >
> > Then the node serving the original request merges
> > the sub-lists to pick the top 2,000. Then the node
> > sends another request to each shard to get
> > the full document. Then the node merges this
> > into the full list to return to the user.
> >
> > Solr really isn't built for this use-case, is it actually
> > a compelling situation?
> >
> > And having your document cache set at 1M is kinda
> > high if you have very big documents.
> >
> > FWIW,
> > Erick
> >
> >
> > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> > >wrote:
> >
> > > Also, I don't see a consistent response time from solr, I ran ab again
> > and
> > > I get this:
> > >
> > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > "
> > >
> > >
> > > Benchmarking x.amazonaws.com (be patient)
> > > Completed 100 requests
> > > Completed 200 requests
> > > Completed 300 requests
> > > Completed 400 requests
> > > Completed 500 requests
> > > Finished 500 requests
> > >
> > >
> > > Server Software:
> > > Server Hostname:       x.amazonaws.com
> > > Server Port:            8983
> > >
> > > Document Path:
> > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > Document Length:        1538537 bytes
> > >
> > > Concurrency Level:      10
> > > Time taken for tests:   10.858 seconds
> > > Complete requests:      500
> > > Failed requests:        8
> > >    (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > Write errors:           0
> > > Total transferred:      769297992 bytes
> > > HTML transferred:       769268492 bytes
> > > Requests per second:    46.05 [#/sec] (mean)
> > > Time per request:       217.167 [ms] (mean)
> > > Time per request:       21.717 [ms] (mean, across all concurrent
> > requests)
> > > Transfer rate:          69187.90 [Kbytes/sec] received
> > >
> > > Connection Times (ms)
> > >               min  mean[+/-sd] median   max
> > > Connect:        0    0   0.3      0       2
> > > Processing:   110  215  72.0    190     497
> > > Waiting:       91  180  70.5    152     473
> > > Total:        112  216  72.0    191     497
> > >
> > > Percentage of the requests served within a certain time (ms)
> > >   50%    191
> > >   66%    225
> > >   75%    252
> > >   80%    272
> > >   90%    319
> > >   95%    364
> > >   98%    420
> > >   99%    453
> > >  100%    497 (longest request)
> > >
> > >
> > > Sometimes it takes a lot of time, sometimes its pretty quick.
> > >
> > > Thanks,
> > > -Utkarsh
> > >
> > >
> > > On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> > > >wrote:
> > >
> > > > Hello,
> > > >
> > > > I have a usecase where I need to retrive top 2000 documents matching
> a
> > > > query.
> > > > What are the parameters (in query, solrconfig, schema) I shoud look
> at
> > to
> > > > improve this?
> > > >
> > > > I have 45M documents in 3node solrcloud 4.3.1 with 3 shards, with
> 30GB
> > > > RAM, 8vCPU and 7GB JVM heap size.
> > > >
> > > > I have documentCache:
> > > >   <documentCache class="solr.LRUCache"  size="1000000"
> > > > initialSize="1000000"   autowarmCount="0"/>
> > > >
> > > > allText is a copyField.
> > > >
> > > > This is the result I get:
> > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > "
> > > >
> > > > Benchmarking x.amazonaws.com (be patient)
> > > > Completed 100 requests
> > > > Completed 200 requests
> > > > Completed 300 requests
> > > > Completed 400 requests
> > > > Completed 500 requests
> > > > Finished 500 requests
> > > >
> > > >
> > > > Server Software:
> > > > Server Hostname:        x.amazonaws.com
> > > > Server Port:            8983
> > > >
> > > > Document Path:
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > Document Length:        1538537 bytes
> > > >
> > > > Concurrency Level:      10
> > > > Time taken for tests:   35.999 seconds
> > > > Complete requests:      500
> > > > Failed requests:        21
> > > >    (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
> > > > Write errors:           0
> > > > Non-2xx responses:      2
> > > > Total transferred:      766221660 bytes
> > > > HTML transferred:       766191806 bytes
> > > > Requests per second:    13.89 [#/sec] (mean)
> > > > Time per request:       719.981 [ms] (mean)
> > > > Time per request:       71.998 [ms] (mean, across all concurrent
> > > requests)
> > > > Transfer rate:          20785.65 [Kbytes/sec] received
> > > >
> > > > Connection Times (ms)
> > > >               min  mean[+/-sd] median   max
> > > > Connect:        0    0   0.6      0       8
> > > > Processing:     9  717 2339.6    199   12611
> > > > Waiting:        9  635 2233.6    164   12580
> > > > Total:          9  718 2339.6    199   12611
> > > >
> > > > Percentage of the requests served within a certain time (ms)
> > > >   50%    199
> > > >   66%    236
> > > >   75%    263
> > > >   80%    281
> > > >   90%    548
> > > >   95%    838
> > > >   98%  12475
> > > >   99%  12545
> > > >  100%  12611 (longest request)
> > > >
> > > > --
> > > > Thanks,
> > > > -Utkarsh
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > -Utkarsh
> > >
> >
>



-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

Reply via email to