Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months?
Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe <sar...@gmail.com> wrote: > Hi Mike, > > The OOM you're seeing is likely a result of the bug described in (and > fixed by a commit under) SOLR-5875: < > https://issues.apache.org/jira/browse/SOLR-5875>. > > If you can build from source, it would be great if you could confirm the > fix addresses the issue you're facing. > > This fix will be part of a to-be-released Solr 4.7.1. > > Steve > > On Mar 17, 2014, at 4:14 PM, Mike Hugo <m...@piragua.com> wrote: > > > Hello, > > > > We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 > > instance to 3 node Solr 4.7 cluster). > > > > Part of out application does an automated traversal of all documents that > > match a specific query. It does this by iterating through results by > > setting the start and rows parameters, starting with start=0 and > rows=1000, > > then start=1000, rows=1000, start = 2000, rows=1000, etc etc. > > > > We do this in parallel fashion with multiple workers on multiple nodes. > > It's easy to chunk up the work to be done by figuring out how many total > > results there are and then creating 'chunks' (0-1000, 1000-2000, > 2000-3000) > > and sending each chunk to a worker in a pool of multi-threaded workers. > > > > This worked well for us with a single server. However upon upgrading to > > solr cloud, we've found that this quickly (within the first 4 or 5 > > requests) causes an OutOfMemory error on the coordinating node that > > receives the query. I don't fully understand what's going on here, but > it > > looks like the coordinating node receives the query and sends it to the > > shard requested. For example, given: > > > > shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000 > > > > The coordinating node sends this query to shard3: > > > > NOW=1395086719189&shard.url= > > > http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000 > > > > Notice the rows parameter is 5000 (start + rows). If the coordinator > node > > is able to process the result set (which works for the first few pages, > > after that it will quickly run out of memory), it eventually issues this > > request back to shard3: > > > > NOW=1395086719189&shard.url= > > > http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000 > > > > and then finally returns the response to the client. > > > > One possible workaround: We've found that if we issue non-distributed > > requests to specific shards, that we get performance along the same lines > > that we did before. E.g. issue a query with shards=shard3&distrib=false > > directly to the url of the shard3 instance, rather than going through the > > cloud solr server solrj API. > > > > The other workaround is to adapt to use the new new cursorMark > > functionality. I've manually tried a few requests and it is pretty > > efficient, and doesn't result in the OOM errors on the coordinating node. > > However, i've only done this in single threaded manner. I'm wondering if > > there would be a way to get cursor marks for an entire result set at a > > given page interval, so that they could then be fed to the pool of > parallel > > workers to get the results in parallel rather than single threaded. Is > > there a way to do this so we could process the results in parallel? > > > > Any other possible solutions? Thanks in advance. > > > > Mike > >