I need to pull a *lot* of records out of a core, to be statistically
analyzed and the stat.s presented to the user, who is sitting at a
browser waiting. So far I haven't seen a way to calculate the stat.s
I need in Solr itself. It's difficult to know the size of the total
result, so I'm running the query repeatedly and windowing the results
with 'start' and 'rows'. I just guessed that a window of 1000
documents would be reasonable. We currently have about 48GB in the
core.
The product uses Solr 4.10. Yes, I know that's very old.
What I got is that every three seconds or so I get another 1000
documents, totalling around 500KB per response. For a user request
for a large range, this is taking way longer than the user's browser
is willing to wait. The single CPU on my test box is at 99%
continuously, and Solr's memory use is around 90% of 8GB. The test
hardware is a VMWare guest on an 'Intel(R) Xeon(R) Gold 6150 CPU @
2.70GHz'.
A sample query:
0:0:0:0:0:0:0:1 - - [10/Apr/2020:13:34:18 -0400] "GET
/solr/statistics/select?q=*%3A*=1000=%2Btype%3A0+%2BbundleName%3AORIGINAL+%2Bstatistics_type%3Aview=%2BisBot%3Afalse=%2Btime%3A%5B2018-01-01T05%3A00%3A00Z+TO+2020-01-01T04%3A59%3A59Z%5D=time+asc=867000=javabin=2
HTTP/1.1" 200 497475 "-"
"Solr[org.apache.solr.client.solrj.impl.HttpSolrServer] 1.0"
As you can see, my test was getting close to 1000 windows. It's still
going. I don't know how far along that is.
So I'm wondering:
o how can I do better than guessing that 1000 is a good window size?
How big a response is too big?
o what else should I be thinking about?
o given that my test on a full-sized copy of the live data has been
running for an hour and is still going, is it totally impractical
to expect that I can improve the process enough to give a response
to an ad-hoc query while-you-wait?
--
Mark H. Wood
Lead Technology Analyst
University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu
signature.asc
Description: PGP signature