Re: Deep paging in parallel with solr cloud - OutOfMemory

Mike Hugo Mon, 17 Mar 2014 15:46:30 -0700

Cursor mark definitely seems like the way to go.  If I can get it to work
in parallel then that's additional bonus



On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury
<greg.pendleb...@gmail.com>wrote:

> Shouldn't all deep pagination against a cluster use the new cursor mark
> feature instead of 'start' and 'rows'?
>
> 4 or 5 requests still seems a very low limit to be running into an OOM
> issues though, so perhaps it is both issues combined?
>
> Ta,
> Greg
>
>
>
> On 18 March 2014 07:49, Mike Hugo <m...@piragua.com> wrote:
>
> > Thanks!
> >
> >
> > On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe <sar...@gmail.com> wrote:
> >
> > > Mike,
> > >
> > > Days.  I plan on making a 4.7.1 release candidate a week from today,
> and
> > > assuming nobody finds any problems with the RC, it will be released
> > roughly
> > > four days thereafter (three days for voting + one day for release
> > > propogation to the Apache mirrors): i.e., next Friday-ish.
> > >
> > > Steve
> > >
> > > On Mar 17, 2014, at 4:40 PM, Mike Hugo <m...@piragua.com> wrote:
> > >
> > > > Thanks Steve,
> > > >
> > > > That certainly looks like it could be the culprit.  Any word on a
> > release
> > > > date for 4.7.1?  Days?  Weeks?  Months?
> > > >
> > > > Mike
> > > >
> > > >
> > > > On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe <sar...@gmail.com>
> wrote:
> > > >
> > > >> Hi Mike,
> > > >>
> > > >> The OOM you're seeing is likely a result of the bug described in
> (and
> > > >> fixed by a commit under) SOLR-5875: <
> > > >> https://issues.apache.org/jira/browse/SOLR-5875>.
> > > >>
> > > >> If you can build from source, it would be great if you could confirm
> > the
> > > >> fix addresses the issue you're facing.
> > > >>
> > > >> This fix will be part of a to-be-released Solr 4.7.1.
> > > >>
> > > >> Steve
> > > >>
> > > >> On Mar 17, 2014, at 4:14 PM, Mike Hugo <m...@piragua.com> wrote:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> We recently upgraded to Solr Cloud 4.7 (went from a single node
> Solr
> > > 4.0
> > > >>> instance to 3 node Solr 4.7 cluster).
> > > >>>
> > > >>> Part of out application does an automated traversal of all
> documents
> > > that
> > > >>> match a specific query.  It does this by iterating through results
> by
> > > >>> setting the start and rows parameters, starting with start=0 and
> > > >> rows=1000,
> > > >>> then start=1000, rows=1000, start = 2000, rows=1000, etc etc.
> > > >>>
> > > >>> We do this in parallel fashion with multiple workers on multiple
> > nodes.
> > > >>> It's easy to chunk up the work to be done by figuring out how many
> > > total
> > > >>> results there are and then creating 'chunks' (0-1000, 1000-2000,
> > > >> 2000-3000)
> > > >>> and sending each chunk to a worker in a pool of multi-threaded
> > workers.
> > > >>>
> > > >>> This worked well for us with a single server.  However upon
> upgrading
> > > to
> > > >>> solr cloud, we've found that this quickly (within the first 4 or 5
> > > >>> requests) causes an OutOfMemory error on the coordinating node that
> > > >>> receives the query.   I don't fully understand what's going on
> here,
> > > but
> > > >> it
> > > >>> looks like the coordinating node receives the query and sends it to
> > the
> > > >>> shard requested.  For example, given:
> > > >>>
> > > >>> shards=shard3&sort=id+asc&start=4000&q=*:*&rows=1000
> > > >>>
> > > >>> The coordinating node sends this query to shard3:
> > > >>>
> > > >>> NOW=1395086719189&shard.url=
> > > >>>
> > > >>
> > >
> >
> http://shard3_url_goes_here:8080/solr/collection1/&fl=id&sort=id+asc&start=0&q=*:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=5000
> > > >>>
> > > >>> Notice the rows parameter is 5000 (start + rows).  If the
> coordinator
> > > >> node
> > > >>> is able to process the result set (which works for the first few
> > pages,
> > > >>> after that it will quickly run out of memory), it eventually issues
> > > this
> > > >>> request back to shard3:
> > > >>>
> > > >>> NOW=1395086719189&shard.url=
> > > >>>
> > > >>
> > >
> >
> http://10.128.215.226:8080/extera-search/gemindex/&start=4000&ids=a..bunch...(1000)..of..doc..ids..go..here&q=*:*&distrib=false&wt=javabin&isShard=true&version=2&rows=1000
> > > >>>
> > > >>> and then finally returns the response to the client.
> > > >>>
> > > >>> One possible workaround:  We've found that if we issue
> > non-distributed
> > > >>> requests to specific shards, that we get performance along the same
> > > lines
> > > >>> that we did before.  E.g. issue a query with
> > > shards=shard3&distrib=false
> > > >>> directly to the url of the shard3 instance, rather than going
> through
> > > the
> > > >>> cloud solr server solrj API.
> > > >>>
> > > >>> The other workaround is to adapt to use the new new cursorMark
> > > >>> functionality.  I've manually tried a few requests and it is pretty
> > > >>> efficient, and doesn't result in the OOM errors on the coordinating
> > > node.
> > > >>> However, i've only done this in single threaded manner.  I'm
> > wondering
> > > if
> > > >>> there would be a way to get cursor marks for an entire result set
> at
> > a
> > > >>> given page interval, so that they could then be fed to the pool of
> > > >> parallel
> > > >>> workers to get the results in parallel rather than single threaded.
> >  Is
> > > >>> there a way to do this so we could process the results in parallel?
> > > >>>
> > > >>> Any other possible solutions?  Thanks in advance.
> > > >>>
> > > >>> Mike
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Deep paging in parallel with solr cloud - OutOfMemory

Reply via email to