Re: Improving performance to return 2000+ documents

Utkarsh Sengar Mon, 01 Jul 2013 13:58:23 -0700

Thanks Erick/Jagdish.

Just to give some background on my queries.


1. All my queries are unique. A query can be: "ipod" and "ipod 8gb" (but
these are unique). These are about 1.2M in total.
So, I assume setting a high queryResultCache, queryResultWindowSize and
queryResultMaxDocsCached won't help.

2. I have this cache settings:
<documentCache class="solr.LRUCache"
                   size="10000"
                   initialSize="10000"
                   autowarmCount="0"
                   cleanupThread="true"/>
//My understanding is, documentCache will help me the most because solr
will cache documents retrieved.
//Stats for documentCache: http://apaste.info/hknh

<queryResultCache class="solr.LRUCache"
                     size="512"
                     initialSize="512"
                     autowarmCount="0"
                     cleanupThread="true"/>
//Default, since my queries are unique.

<filterCache class="solr.FastLRUCache"
                 size="512"
                 initialSize="512"
                 autowarmCount="0"/>
//Now sure how can I use filterCache, so I am keeping it as the default

<enableLazyFieldLoading>true</enableLazyFieldLoading>
<queryResultWindowSize>100</queryResultWindowSize>
<queryResultMaxDocsCached>100</queryResultMaxDocsCached>


I think the question can also be framed as: How can I optimize solr
response time for 50M product catalog for unique queries which retrieves
2000 documents in one go.
I looked at a solr search component, I think writing a "proxy" around solr
was easier, so I went ahead with this approach.


Thanks,
-Utkarsh




On Sun, Jun 30, 2013 at 6:54 PM, Jagdish Nomula <jagd...@simplyhired.com>wrote:

> Solrconfig.xml has got entries which you can tweak for your use case. One
> of them is queryresultwindowsize. You can try using the value of 2000 and
> see if it helps improving performance. Please make sure you have enough
> memory allocated for queryresultcache.
> A combination of sharding and distribution of workload(requesting
> 2000/number of shards) with an aggregator would be a good way to maximize
> performance.
>
> Thanks,
>
> Jagdish
>
>
> On Sun, Jun 30, 2013 at 6:48 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > 50M documents, depending on a bunch of things,
> > may not be unreasonable for a single node, only
> > testing will tell.
> >
> > But the question I have is whether you should be
> > using standard Solr queries for this or building a custom
> > component that goes at the base Lucene index
> > and "does the right thing". Or even re-indexing your
> > entire corpus periodically to add this kind of data.
> >
> > FWIW,
> > Erick
> >
> >
> > On Sun, Jun 30, 2013 at 2:00 PM, Utkarsh Sengar <utkarsh2...@gmail.com
> > >wrote:
> >
> > > Thanks Erick/Peter.
> > >
> > > This is an offline process, used by a relevancy engine implemented
> around
> > > solr. The engine computes boost scores for related keywords based on
> > > clickstream data.
> > > i.e.: say clickstream has: ipad=upc1,upc2,upc3
> > > I query solr with keyword: "ipad" (to get 2000 documents) and then
> make 3
> > > individual queries for upc1,upc2,upc3 (which are fast).
> > > The data is then used to compute related keywords to "ipad" with their
> > > boost values.
> > >
> > > So, I cannot really replace that, since I need full text search over my
> > > dataset to retrieve top 2000 documents.
> > >
> > > I tried paging: I retrieve 500 solr documents 4 times (0-500,
> > 500-1000...),
> > > but don't see any improvements.
> > >
> > >
> > > Some questions:
> > > 1. Maybe the JVM size might help?
> > > This is what I see in the dashboard:
> > > Physical Memory 76.2%
> > > Swap Space NaN% (don't have any swap space, running on AWS EBS)
> > > File Descriptor Count 4.7%
> > > JVM-Memory 73.8%
> > >
> > > Screenshot: http://i.imgur.com/aegKzP6.png
> > >
> > > 2. Will reducing the shards from 3 to 1 improve performance? (maybe
> > > increase the RAM from 30 to 60GB) The problem I will face in that case
> > will
> > > be fitting 50M documents on 1 machine.
> > >
> > > Thanks,
> > > -Utkarsh
> > >
> > >
> > > On Sat, Jun 29, 2013 at 3:58 PM, Peter Sturge <peter.stu...@gmail.com
> > > >wrote:
> > >
> > > > Hello Utkarsh,
> > > > This may or may not be relevant for your use-case, but the way we
> deal
> > > with
> > > > this scenario is to retrieve the top N documents 5,10,20or100 at a
> time
> > > > (user selectable). We can then page the results, changing the start
> > > > parameter to return the next set. This allows us to 'retrieve'
> millions
> > > of
> > > > documents - we just do it at the user's leisure, rather than make
> them
> > > wait
> > > > for the whole lot in one go.
> > > > This works well because users very rarely want to see ALL 2000 (or
> > > whatever
> > > > number) documents at one time - it's simply too much to take in at
> one
> > > > time.
> > > > If your use-case involves an automated or offline procedure (e.g.
> > > running a
> > > > report or some data-mining op), then presumably it doesn't matter so
> > much
> > > > it takes a bit longer (as long as it returns in some reasonble time).
> > > > Have you looked at doing paging on the client-side - this will hugely
> > > > speed-up your search time.
> > > > HTH
> > > > Peter
> > > >
> > > >
> > > >
> > > > On Sat, Jun 29, 2013 at 6:17 PM, Erick Erickson <
> > erickerick...@gmail.com
> > > > >wrote:
> > > >
> > > > > Well, depending on how many docs get served
> > > > > from the cache the time will vary. But this is
> > > > > just ugly, if you can avoid this use-case it would
> > > > > be a Good Thing.
> > > > >
> > > > > Problem here is that each and every shard must
> > > > > assemble the list of 2,000 documents (just ID and
> > > > > sort criteria, usually score).
> > > > >
> > > > > Then the node serving the original request merges
> > > > > the sub-lists to pick the top 2,000. Then the node
> > > > > sends another request to each shard to get
> > > > > the full document. Then the node merges this
> > > > > into the full list to return to the user.
> > > > >
> > > > > Solr really isn't built for this use-case, is it actually
> > > > > a compelling situation?
> > > > >
> > > > > And having your document cache set at 1M is kinda
> > > > > high if you have very big documents.
> > > > >
> > > > > FWIW,
> > > > > Erick
> > > > >
> > > > >
> > > > > On Fri, Jun 28, 2013 at 8:44 PM, Utkarsh Sengar <
> > utkarsh2...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > Also, I don't see a consistent response time from solr, I ran ab
> > > again
> > > > > and
> > > > > > I get this:
> > > > > >
> > > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > "
> > > > > >
> > > > > >
> > > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > > Completed 100 requests
> > > > > > Completed 200 requests
> > > > > > Completed 300 requests
> > > > > > Completed 400 requests
> > > > > > Completed 500 requests
> > > > > > Finished 500 requests
> > > > > >
> > > > > >
> > > > > > Server Software:
> > > > > > Server Hostname:       x.amazonaws.com
> > > > > > Server Port:            8983
> > > > > >
> > > > > > Document Path:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > Document Length:        1538537 bytes
> > > > > >
> > > > > > Concurrency Level:      10
> > > > > > Time taken for tests:   10.858 seconds
> > > > > > Complete requests:      500
> > > > > > Failed requests:        8
> > > > > >    (Connect: 0, Receive: 0, Length: 8, Exceptions: 0)
> > > > > > Write errors:           0
> > > > > > Total transferred:      769297992 bytes
> > > > > > HTML transferred:       769268492 bytes
> > > > > > Requests per second:    46.05 [#/sec] (mean)
> > > > > > Time per request:       217.167 [ms] (mean)
> > > > > > Time per request:       21.717 [ms] (mean, across all concurrent
> > > > > requests)
> > > > > > Transfer rate:          69187.90 [Kbytes/sec] received
> > > > > >
> > > > > > Connection Times (ms)
> > > > > >               min  mean[+/-sd] median   max
> > > > > > Connect:        0    0   0.3      0       2
> > > > > > Processing:   110  215  72.0    190     497
> > > > > > Waiting:       91  180  70.5    152     473
> > > > > > Total:        112  216  72.0    191     497
> > > > > >
> > > > > > Percentage of the requests served within a certain time (ms)
> > > > > >   50%    191
> > > > > >   66%    225
> > > > > >   75%    252
> > > > > >   80%    272
> > > > > >   90%    319
> > > > > >   95%    364
> > > > > >   98%    420
> > > > > >   99%    453
> > > > > >  100%    497 (longest request)
> > > > > >
> > > > > >
> > > > > > Sometimes it takes a lot of time, sometimes its pretty quick.
> > > > > >
> > > > > > Thanks,
> > > > > > -Utkarsh
> > > > > >
> > > > > >
> > > > > > On Fri, Jun 28, 2013 at 5:39 PM, Utkarsh Sengar <
> > > utkarsh2...@gmail.com
> > > > > > >wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I have a usecase where I need to retrive top 2000 documents
> > > matching
> > > > a
> > > > > > > query.
> > > > > > > What are the parameters (in query, solrconfig, schema) I shoud
> > look
> > > > at
> > > > > to
> > > > > > > improve this?
> > > > > > >
> > > > > > > I have 45M documents in 3node solrcloud 4.3.1 with 3 shards,
> with
> > > > 30GB
> > > > > > > RAM, 8vCPU and 7GB JVM heap size.
> > > > > > >
> > > > > > > I have documentCache:
> > > > > > >   <documentCache class="solr.LRUCache"  size="1000000"
> > > > > > > initialSize="1000000"   autowarmCount="0"/>
> > > > > > >
> > > > > > > allText is a copyField.
> > > > > > >
> > > > > > > This is the result I get:
> > > > > > > ubuntu@ip-10-149-6-68:~$ ab -c 10 -n 500 "
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://x.amazonaws.com:8983/solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > > "
> > > > > > >
> > > > > > > Benchmarking x.amazonaws.com (be patient)
> > > > > > > Completed 100 requests
> > > > > > > Completed 200 requests
> > > > > > > Completed 300 requests
> > > > > > > Completed 400 requests
> > > > > > > Completed 500 requests
> > > > > > > Finished 500 requests
> > > > > > >
> > > > > > >
> > > > > > > Server Software:
> > > > > > > Server Hostname:        x.amazonaws.com
> > > > > > > Server Port:            8983
> > > > > > >
> > > > > > > Document Path:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /solr/prodinfo/select?q=allText:huggies%20diapers%20size%201&rows=2000&wt=json
> > > > > > > Document Length:        1538537 bytes
> > > > > > >
> > > > > > > Concurrency Level:      10
> > > > > > > Time taken for tests:   35.999 seconds
> > > > > > > Complete requests:      500
> > > > > > > Failed requests:        21
> > > > > > >    (Connect: 0, Receive: 0, Length: 21, Exceptions: 0)
> > > > > > > Write errors:           0
> > > > > > > Non-2xx responses:      2
> > > > > > > Total transferred:      766221660 bytes
> > > > > > > HTML transferred:       766191806 bytes
> > > > > > > Requests per second:    13.89 [#/sec] (mean)
> > > > > > > Time per request:       719.981 [ms] (mean)
> > > > > > > Time per request:       71.998 [ms] (mean, across all
> concurrent
> > > > > > requests)
> > > > > > > Transfer rate:          20785.65 [Kbytes/sec] received
> > > > > > >
> > > > > > > Connection Times (ms)
> > > > > > >               min  mean[+/-sd] median   max
> > > > > > > Connect:        0    0   0.6      0       8
> > > > > > > Processing:     9  717 2339.6    199   12611
> > > > > > > Waiting:        9  635 2233.6    164   12580
> > > > > > > Total:          9  718 2339.6    199   12611
> > > > > > >
> > > > > > > Percentage of the requests served within a certain time (ms)
> > > > > > >   50%    199
> > > > > > >   66%    236
> > > > > > >   75%    263
> > > > > > >   80%    281
> > > > > > >   90%    548
> > > > > > >   95%    838
> > > > > > >   98%  12475
> > > > > > >   99%  12545
> > > > > > >  100%  12611 (longest request)
> > > > > > >
> > > > > > > --
> > > > > > > Thanks,
> > > > > > > -Utkarsh
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks,
> > > > > > -Utkarsh
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > -Utkarsh
> > >
> >
>
>
>
> --
> ***Jagdish Nomula*
> Sr. Manager Search
> Simply Hired, Inc.
> 370 San Aleso Ave., Ste 200
> Sunnyvale, CA 94085
>
> office - 408.400.4700
> cell - 408.431.2916
> email - jagd...@simplyhired.com <yourem...@simplyhired.com>
>
> www.simplyhired.com
>



-- 
Thanks,
-Utkarsh

Re: Improving performance to return 2000+ documents

Reply via email to