Re: Optimal size for queries?

2020-04-15 Thread Mark H. Wood
On Wed, Apr 15, 2020 at 10:09:59AM +0100, Colvin Cowie wrote:
> Hi, I can't answer the question as to what the optimal size of rows per
> request is. I would expect it to depend on the number of stored fields
> being marshaled, and their type, and your hardware.

It was a somewhat naive question, but I wasn't sure how to ask a
better one.  Having thought a bit more, I expect that the eventual
solution to my problem will include a number of different changes,
including larger pages, tuning several caches, providing a progress
indicator to the user, and (as you point out below) re-thinking how I
ask Solr for so many documents.

> But using start + rows is a *bad thing* for deep paging. You need to use
> cursorMark, which looks like it was added in 4.7 originally
> https://issues.apache.org/jira/browse/SOLR-5463
> There's a description on the newer reference guide
> https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors
> and in the 4.10 PDF on page 305
> https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf
> 
> http://yonik.com/solr/paging-and-deep-paging/

Thank you for the links.  I think these will be very helpful.

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu


signature.asc
Description: PGP signature


Re: Optimal size for queries?

2020-04-15 Thread Colvin Cowie
Hi, I can't answer the question as to what the optimal size of rows per
request is. I would expect it to depend on the number of stored fields
being marshaled, and their type, and your hardware.

But using start + rows is a *bad thing* for deep paging. You need to use
cursorMark, which looks like it was added in 4.7 originally
https://issues.apache.org/jira/browse/SOLR-5463
There's a description on the newer reference guide
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors
and in the 4.10 PDF on page 305
https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-4.10.pdf

http://yonik.com/solr/paging-and-deep-paging/


On Fri, 10 Apr 2020 at 19:05, Mark H. Wood  wrote:

> I need to pull a *lot* of records out of a core, to be statistically
> analyzed and the stat.s presented to the user, who is sitting at a
> browser waiting.  So far I haven't seen a way to calculate the stat.s
> I need in Solr itself.  It's difficult to know the size of the total
> result, so I'm running the query repeatedly and windowing the results
> with 'start' and 'rows'.  I just guessed that a window of 1000
> documents would be reasonable.  We currently have about 48GB in the
> core.
>
> The product uses Solr 4.10.  Yes, I know that's very old.
>
> What I got is that every three seconds or so I get another 1000
> documents, totalling around 500KB per response.  For a user request
> for a large range, this is taking way longer than the user's browser
> is willing to wait.  The single CPU on my test box is at 99%
> continuously, and Solr's memory use is around 90% of 8GB.  The test
> hardware is a VMWare guest on an 'Intel(R) Xeon(R) Gold 6150 CPU @
> 2.70GHz'.
>
> A sample query:
>
> 0:0:0:0:0:0:0:1 - - [10/Apr/2020:13:34:18 -0400] "GET
> /solr/statistics/select?q=*%3A*=1000=%2Btype%3A0+%2BbundleName%3AORIGINAL+%2Bstatistics_type%3Aview=%2BisBot%3Afalse=%2Btime%3A%5B2018-01-01T05%3A00%3A00Z+TO+2020-01-01T04%3A59%3A59Z%5D=time+asc=867000=javabin=2
> HTTP/1.1" 200 497475 "-"
> "Solr[org.apache.solr.client.solrj.impl.HttpSolrServer] 1.0"
>
> As you can see, my test was getting close to 1000 windows.  It's still
> going.  I don't know how far along that is.
>
> So I'm wondering:
>
> o  how can I do better than guessing that 1000 is a good window size?
>How big a response is too big?
>
> o  what else should I be thinking about?
>
> o  given that my test on a full-sized copy of the live data has been
>running for an hour and is still going, is it totally impractical
>to expect that I can improve the process enough to give a response
>to an ad-hoc query while-you-wait?
>
> --
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu
>


Optimal size for queries?

2020-04-10 Thread Mark H. Wood
I need to pull a *lot* of records out of a core, to be statistically
analyzed and the stat.s presented to the user, who is sitting at a
browser waiting.  So far I haven't seen a way to calculate the stat.s
I need in Solr itself.  It's difficult to know the size of the total
result, so I'm running the query repeatedly and windowing the results
with 'start' and 'rows'.  I just guessed that a window of 1000
documents would be reasonable.  We currently have about 48GB in the
core.

The product uses Solr 4.10.  Yes, I know that's very old.

What I got is that every three seconds or so I get another 1000
documents, totalling around 500KB per response.  For a user request
for a large range, this is taking way longer than the user's browser
is willing to wait.  The single CPU on my test box is at 99%
continuously, and Solr's memory use is around 90% of 8GB.  The test
hardware is a VMWare guest on an 'Intel(R) Xeon(R) Gold 6150 CPU @
2.70GHz'.

A sample query:

0:0:0:0:0:0:0:1 - - [10/Apr/2020:13:34:18 -0400] "GET 
/solr/statistics/select?q=*%3A*=1000=%2Btype%3A0+%2BbundleName%3AORIGINAL+%2Bstatistics_type%3Aview=%2BisBot%3Afalse=%2Btime%3A%5B2018-01-01T05%3A00%3A00Z+TO+2020-01-01T04%3A59%3A59Z%5D=time+asc=867000=javabin=2
 HTTP/1.1" 200 497475 "-" 
"Solr[org.apache.solr.client.solrj.impl.HttpSolrServer] 1.0"

As you can see, my test was getting close to 1000 windows.  It's still
going.  I don't know how far along that is.

So I'm wondering:

o  how can I do better than guessing that 1000 is a good window size?
   How big a response is too big?

o  what else should I be thinking about?

o  given that my test on a full-sized copy of the live data has been
   running for an hour and is still going, is it totally impractical
   to expect that I can improve the process enough to give a response
   to an ad-hoc query while-you-wait?

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu


signature.asc
Description: PGP signature