I should add each node has 16G of ram, 8GB of which is allocated to the
JVM.  Each node has about 200k docs and happily uses only about 3 or 4gb of
ram during normal operation.  It's only during this deep pagination that we
have seen OOM errors.


On Mon, Mar 17, 2014 at 3:14 PM, Mike Hugo <m...@piragua.com> wrote:

> Hello,
>
> We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0
> instance to 3 node Solr 4.7 cluster).
>
> Part of out application does an automated traversal of all documents that
> match a specific query.  It does this by iterating through results by
> setting the start and rows parameters, starting with start=0 and rows=1000,
> then start=1000, rows=1000, start = 2000, rows=1000, etc etc.
>
> We do this in parallel fashion with multiple workers on multiple nodes.
>  It's easy to chunk up the work to be done by figuring out how many total
> results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000)
> and sending each chunk to a worker in a pool of multi-threaded workers.
>
> This worked well for us with a single server.  However upon upgrading to
> solr cloud, we've found that this quickly (within the first 4 or 5
> requests) causes an OutOfMemory error on the coordinating node that
> receives the query.   I don't fully understand what's going on here, but it
> looks like the coordinating node receives the query and sends it to the
> shard requested.  For example, given:
>
> shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000
>
> The coordinating node sends this query to shard3:
>
> NOW=1395086719189&shard.url=
> http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000
>
> Notice the rows parameter is 5000 (start + rows).  If the coordinator node
> is able to process the result set (which works for the first few pages,
> after that it will quickly run out of memory), it eventually issues this
> request back to shard3:
>
> NOW=1395086719189&shard.url=
> http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000
>
> and then finally returns the response to the client.
>
> One possible workaround:  We've found that if we issue non-distributed
> requests to specific shards, that we get performance along the same lines
> that we did before.  E.g. issue a query with shards=shard3&distrib=false
> directly to the url of the shard3 instance, rather than going through the
> cloud solr server solrj API.
>
> The other workaround is to adapt to use the new new cursorMark
> functionality.  I've manually tried a few requests and it is pretty
> efficient, and doesn't result in the OOM errors on the coordinating node.
>  However, i've only done this in single threaded manner.  I'm wondering if
> there would be a way to get cursor marks for an entire result set at a
> given page interval, so that they could then be fed to the pool of parallel
> workers to get the results in parallel rather than single threaded.  Is
> there a way to do this so we could process the results in parallel?
>
> Any other possible solutions?  Thanks in advance.
>
> Mike
>
>
>

Reply via email to