On 6/22/2018 6:50 AM, Arturas Mazeika wrote:
I grabbed the 2.7.1 version of solr, created a 4 core setup with
replication factor 2 on windows using [1], I've restarted the setup with
2GB for each node [2], inserted the html docs from the german wikipedia
archive [3], and obtained top 10 terms for the whole collection vs one
specific shard:
http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":5287},
"terms":{
"text":[
"8",670564,
"application",670564,
"articles",670564,
"charset",670564,
"de",670564,
"f",670564,
"utf",670564,
"wiki",670564,
"xhtml",670564,
"xml",670564]}}
http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":20274},
"terms":{
"text":{
"8":671396,
"application":671396,
"articles":671396,
"charset":671396,
"de":671396,
"f":671396,
"utf":671396,
"wiki":671396,
"xhtml":671396,
"xml":671396}}}
The value of 'shards.qt' should be /terms, not the name of a core.
Here's something you might want to try instead for the second query, so
you won't need shards.qt at all:
http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false
You might actually want to add shards.qt=/terms to the first query, or
even to the definition of the /terms handler in solrconfig.xml so that
all distributed queries are sent to the same handler instead of going to
/select.
reveals:
(1) querying one shard takes 20 secs vs 5 secs for the whole index
That is strange. With the shards.qt parameter set to a core name, I'm
surprised you got anything at all on the second query, but maybe when it
couldn't find a handler with that name, it just defaulted to /select
like it would if you didn't include the parameter. I wonder if having
an invalid handler contributed to the speed.
(2) the counts for one shards are higher than for the whole index
If you're not changing the index between the requests, and it doesn't
sound like you are, I have no idea why that might happen.
(3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
for 20 secs shows that java process is neither being pushed on the CPU nor
on the SDD side to the limits. What is the bottleneck in this computation?
If the amount of memory in the system (NOT talking about heap size here)
is not sufficient to effectively cache the index, then Solr must
actually hit the disk to satisfy a query. Even an SSD is not as fast as
memory. You haven't indicated how much disk space is being consumed by
the eight index cores or how much total memory the system has. A little
more than 8GB of the system's memory is being taken up by the four Solr
processes. Because you've asked for two replicas, there are two
complete copies of the index on the system, and both copies will count
in the total amount of resources that are required.
If there *is* sufficient memory for effective index caching, then the
disk will barely see any usage during queries, because Solr will get
most of the data it needs from the OS disk cache (system memory). This
will also reduce the impact on the CPU, because it will not be waiting
for I/O.
Running a query is not going to read the entire index. If it did, Solr
would not be fast.
(4) the output format is slightly different (compare ',' vs ':' and vector
vs list). I wonder why
That I cannot explain. The first response doesn't look right to me. It
passes RFC 4627 validation, but the software parsing the response would
have to be very different for each of the output formats.
Thanks,
Shawn