Cursor mark definitely seems like the way to go. If I can get it to work in parallel then that's additional bonus
On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury <greg.pendleb...@gmail.com>wrote: > Shouldn't all deep pagination against a cluster use the new cursor mark > feature instead of 'start' and 'rows'? > > 4 or 5 requests still seems a very low limit to be running into an OOM > issues though, so perhaps it is both issues combined? > > Ta, > Greg > > > > On 18 March 2014 07:49, Mike Hugo <m...@piragua.com> wrote: > > > Thanks! > > > > > > On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe <sar...@gmail.com> wrote: > > > > > Mike, > > > > > > Days. I plan on making a 4.7.1 release candidate a week from today, > and > > > assuming nobody finds any problems with the RC, it will be released > > roughly > > > four days thereafter (three days for voting + one day for release > > > propogation to the Apache mirrors): i.e., next Friday-ish. > > > > > > Steve > > > > > > On Mar 17, 2014, at 4:40 PM, Mike Hugo <m...@piragua.com> wrote: > > > > > > > Thanks Steve, > > > > > > > > That certainly looks like it could be the culprit. Any word on a > > release > > > > date for 4.7.1? Days? Weeks? Months? > > > > > > > > Mike > > > > > > > > > > > > On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe <sar...@gmail.com> > wrote: > > > > > > > >> Hi Mike, > > > >> > > > >> The OOM you're seeing is likely a result of the bug described in > (and > > > >> fixed by a commit under) SOLR-5875: < > > > >> https://issues.apache.org/jira/browse/SOLR-5875>. > > > >> > > > >> If you can build from source, it would be great if you could confirm > > the > > > >> fix addresses the issue you're facing. > > > >> > > > >> This fix will be part of a to-be-released Solr 4.7.1. > > > >> > > > >> Steve > > > >> > > > >> On Mar 17, 2014, at 4:14 PM, Mike Hugo <m...@piragua.com> wrote: > > > >> > > > >>> Hello, > > > >>> > > > >>> We recently upgraded to Solr Cloud 4.7 (went from a single node > Solr > > > 4.0 > > > >>> instance to 3 node Solr 4.7 cluster). > > > >>> > > > >>> Part of out application does an automated traversal of all > documents > > > that > > > >>> match a specific query. It does this by iterating through results > by > > > >>> setting the start and rows parameters, starting with start=0 and > > > >> rows=1000, > > > >>> then start=1000, rows=1000, start = 2000, rows=1000, etc etc. > > > >>> > > > >>> We do this in parallel fashion with multiple workers on multiple > > nodes. > > > >>> It's easy to chunk up the work to be done by figuring out how many > > > total > > > >>> results there are and then creating 'chunks' (0-1000, 1000-2000, > > > >> 2000-3000) > > > >>> and sending each chunk to a worker in a pool of multi-threaded > > workers. > > > >>> > > > >>> This worked well for us with a single server. However upon > upgrading > > > to > > > >>> solr cloud, we've found that this quickly (within the first 4 or 5 > > > >>> requests) causes an OutOfMemory error on the coordinating node that > > > >>> receives the query. I don't fully understand what's going on > here, > > > but > > > >> it > > > >>> looks like the coordinating node receives the query and sends it to > > the > > > >>> shard requested. For example, given: > > > >>> > > > >>> shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000 > > > >>> > > > >>> The coordinating node sends this query to shard3: > > > >>> > > > >>> NOW=1395086719189&shard.url= > > > >>> > > > >> > > > > > > http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000 > > > >>> > > > >>> Notice the rows parameter is 5000 (start + rows). If the > coordinator > > > >> node > > > >>> is able to process the result set (which works for the first few > > pages, > > > >>> after that it will quickly run out of memory), it eventually issues > > > this > > > >>> request back to shard3: > > > >>> > > > >>> NOW=1395086719189&shard.url= > > > >>> > > > >> > > > > > > http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000 > > > >>> > > > >>> and then finally returns the response to the client. > > > >>> > > > >>> One possible workaround: We've found that if we issue > > non-distributed > > > >>> requests to specific shards, that we get performance along the same > > > lines > > > >>> that we did before. E.g. issue a query with > > > shards=shard3&distrib=false > > > >>> directly to the url of the shard3 instance, rather than going > through > > > the > > > >>> cloud solr server solrj API. > > > >>> > > > >>> The other workaround is to adapt to use the new new cursorMark > > > >>> functionality. I've manually tried a few requests and it is pretty > > > >>> efficient, and doesn't result in the OOM errors on the coordinating > > > node. > > > >>> However, i've only done this in single threaded manner. I'm > > wondering > > > if > > > >>> there would be a way to get cursor marks for an entire result set > at > > a > > > >>> given page interval, so that they could then be fed to the pool of > > > >> parallel > > > >>> workers to get the results in parallel rather than single threaded. > > Is > > > >>> there a way to do this so we could process the results in parallel? > > > >>> > > > >>> Any other possible solutions? Thanks in advance. > > > >>> > > > >>> Mike > > > >> > > > >> > > > > > > > > >