On 6/22/2018 6:50 AM, Arturas Mazeika wrote:
I grabbed the 2.7.1 version of solr, created a 4 core setup with
replication factor 2 on windows using [1], I've restarted the setup with
2GB for each node [2], inserted the html docs from the german wikipedia
archive [3], and obtained top 10 terms for the whole collection vs one
specific shard:
http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
{
"responseHeader":{
"zkConnected":true,

     "status":0,
     "QTime":5287},
   "terms":{
     "text":[
       "8",670564,
       "application",670564,
       "articles",670564,
       "charset",670564,
       "de",670564,
       "f",670564,
       "utf",670564,
       "wiki",670564,
       "xhtml",670564,
       "xml",670564]}}

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1

{
   "responseHeader":{
     "zkConnected":true,
     "status":0,
     "QTime":20274},
   "terms":{
     "text":{
       "8":671396,
       "application":671396,
       "articles":671396,
       "charset":671396,
       "de":671396,
       "f":671396,
       "utf":671396,
       "wiki":671396,
       "xhtml":671396,
       "xml":671396}}}

The value of 'shards.qt' should be /terms, not the name of a core.  Here's something you might want to try instead for the second query, so you won't need shards.qt at all:

http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

You might actually want to add shards.qt=/terms to the first query, or even to the definition of the /terms handler in solrconfig.xml so that all distributed queries are sent to the same handler instead of going to /select.

reveals:
(1) querying one shard takes 20 secs vs 5 secs for the whole index

That is strange.  With the shards.qt parameter set to a core name, I'm surprised you got anything at all on the second query, but maybe when it couldn't find a handler with that name, it just defaulted to /select like it would if you didn't include the parameter.  I wonder if having an invalid handler contributed to the speed.

(2) the counts for one shards are higher than for the whole index

If you're not changing the index between the requests, and it doesn't sound like you are, I have no idea why that might happen.

(3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
for 20 secs shows that java process is neither being pushed on the CPU nor
on the SDD side to the limits. What is the bottleneck in this computation?

If the amount of memory in the system (NOT talking about heap size here) is not sufficient to effectively cache the index, then Solr must actually hit the disk to satisfy a query.  Even an SSD is not as fast as memory.  You haven't indicated how much disk space is being consumed by the eight index cores or how much total memory the system has.  A little more than 8GB of the system's memory is being taken up by the four Solr processes.  Because you've asked for two replicas, there are two complete copies of the index on the system, and both copies will count in the total amount of resources that are required.

If there *is* sufficient memory for effective index caching, then the disk will barely see any usage during queries, because Solr will get most of the data it needs from the OS disk cache (system memory).  This will also reduce the impact on the CPU, because it will not be waiting for I/O.

Running a query is not going to read the entire index.  If it did, Solr would not be fast.

(4) the output format is slightly different (compare ',' vs ':' and vector
vs list). I wonder why

That I cannot explain.  The first response doesn't look right to me.  It passes RFC 4627 validation, but the software parsing the response would have to be very different for each of the output formats.

Thanks,
Shawn

Reply via email to