Re: top 10 query overall vs shard

Shawn Heisey Fri, 22 Jun 2018 07:13:26 -0700

On 6/22/2018 6:50 AM, Arturas Mazeika wrote:

I grabbed the 2.7.1 version of solr, created a 4 core setup with
replication factor 2 on windows using [1], I've restarted the setup with
2GB for each node [2], inserted the html docs from the german wikipedia
archive [3], and obtained top 10 terms for the whole collection vs one
specific shard:
http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json
{
"responseHeader":{
"zkConnected":true,


     "status":0,
     "QTime":5287},
   "terms":{
     "text":[
       "8",670564,
       "application",670564,
       "articles",670564,
       "charset",670564,
       "de",670564,
       "f",670564,
       "utf",670564,
       "wiki",670564,
       "xhtml",670564,
       "xml",670564]}}

http://localhost:9999/solr/de_wiki_all/terms?terms.limit=10&terms.fl=text&wt=json&shards=localhost:9999/solr/de_wiki_all_shard1_replica_n1&shards.qt=de_wiki_all_shard1_replica_n1

{
   "responseHeader":{
     "zkConnected":true,
     "status":0,
     "QTime":20274},
   "terms":{
     "text":{
       "8":671396,
       "application":671396,
       "articles":671396,
       "charset":671396,
       "de":671396,
       "f":671396,
       "utf":671396,
       "wiki":671396,
       "xhtml":671396,
       "xml":671396}}}

The value of 'shards.qt' should be /terms, not the name of a core. Here's something you might want to try instead for the second query, soyou won't need shards.qt at all:


http://localhost:9999/solr/de_wiki_all_shard1_replica_n1/terms?terms.limit=10&terms.fl=text&wt=json&distrib=false

You might actually want to add shards.qt=/terms to the first query, oreven to the definition of the /terms handler in solrconfig.xml so thatall distributed queries are sent to the same handler instead of going to/select.

reveals:
(1) querying one shard takes 20 secs vs 5 secs for the whole index

That is strange. With the shards.qt parameter set to a core name, I'msurprised you got anything at all on the second query, but maybe when itcouldn't find a handler with that name, it just defaulted to /selectlike it would if you didn't include the parameter. I wonder if havingan invalid handler contributed to the speed.

(2) the counts for one shards are higher than for the whole index

If you're not changing the index between the requests, and it doesn'tsound like you are, I have no idea why that might happen.

(3) the f: hard drive is samsung SSD 850 evo 4TB (CrystalDeiskMark shows
~500MB/s seq and ~300MBs random read/writes), CPU:i7-6400 @3.4GHz. Querying
for 20 secs shows that java process is neither being pushed on the CPU nor
on the SDD side to the limits. What is the bottleneck in this computation?

If the amount of memory in the system (NOT talking about heap size here)is not sufficient to effectively cache the index, then Solr mustactually hit the disk to satisfy a query. Even an SSD is not as fast asmemory. You haven't indicated how much disk space is being consumed bythe eight index cores or how much total memory the system has. A littlemore than 8GB of the system's memory is being taken up by the four Solrprocesses. Because you've asked for two replicas, there are twocomplete copies of the index on the system, and both copies will countin the total amount of resources that are required.

If there *is* sufficient memory for effective index caching, then thedisk will barely see any usage during queries, because Solr will getmost of the data it needs from the OS disk cache (system memory). Thiswill also reduce the impact on the CPU, because it will not be waitingfor I/O.

Running a query is not going to read the entire index. If it did, Solrwould not be fast.

(4) the output format is slightly different (compare ',' vs ':' and vector
vs list). I wonder why

That I cannot explain. The first response doesn't look right to me. Itpasses RFC 4627 validation, but the software parsing the response wouldhave to be very different for each of the output formats.


Thanks,
Shawn

Re: top 10 query overall vs shard

Reply via email to