Hello,

We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0
instance to 3 node Solr 4.7 cluster).

Part of out application does an automated traversal of all documents that
match a specific query.  It does this by iterating through results by
setting the start and rows parameters, starting with start=0 and rows=1000,
then start=1000, rows=1000, start = 2000, rows=1000, etc etc.

We do this in parallel fashion with multiple workers on multiple nodes.
 It's easy to chunk up the work to be done by figuring out how many total
results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000)
and sending each chunk to a worker in a pool of multi-threaded workers.

This worked well for us with a single server.  However upon upgrading to
solr cloud, we've found that this quickly (within the first 4 or 5
requests) causes an OutOfMemory error on the coordinating node that
receives the query.   I don't fully understand what's going on here, but it
looks like the coordinating node receives the query and sends it to the
shard requested.  For example, given:

shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000

The coordinating node sends this query to shard3:

NOW=1395086719189&shard.url=
http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000

Notice the rows parameter is 5000 (start + rows).  If the coordinator node
is able to process the result set (which works for the first few pages,
after that it will quickly run out of memory), it eventually issues this
request back to shard3:

NOW=1395086719189&shard.url=
http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000

and then finally returns the response to the client.

One possible workaround:  We've found that if we issue non-distributed
requests to specific shards, that we get performance along the same lines
that we did before.  E.g. issue a query with shards=shard3&distrib=false
directly to the url of the shard3 instance, rather than going through the
cloud solr server solrj API.

The other workaround is to adapt to use the new new cursorMark
functionality.  I've manually tried a few requests and it is pretty
efficient, and doesn't result in the OOM errors on the coordinating node.
 However, i've only done this in single threaded manner.  I'm wondering if
there would be a way to get cursor marks for an entire result set at a
given page interval, so that they could then be fed to the pool of parallel
workers to get the results in parallel rather than single threaded.  Is
there a way to do this so we could process the results in parallel?

Any other possible solutions?  Thanks in advance.

Mike

Reply via email to