Re: Another accumulator question
I"m not sure if it's an exact match, or just very close :-) I don't think our problem is the workload on the driver, I think it's just memory - so while the solution proposed there would work, it would also be sufficient for our purposes, I believe, simply to clear each block as soon as it's added into the canonical version, and try to do so as soon as possible - but I could be misunderstanding some of the timing, I'm still investigating. Though to combine on the worker before returning, as he suggests, would probably be even better. On Fri, Nov 21, 2014 at 6:08 PM, Andrew Ash wrote: > Hi Nathan, > > It sounds like what you're asking for has already been filed as > https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match > what you're proposing? > > Andrew > > On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> We've done this with reduce - that definitely works. >> >> I've reworked the logic to use accumulators because, when it works, it's >> 5-10x faster >> >> On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen wrote: >> >>> This sounds more like a use case for reduce? or fold? it sounds like >>> you're kind of cobbling together the same function on accumulators, >>> when reduce/fold are simpler and have the behavior you suggest. >>> >>> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld >>> wrote: >>> > I think I understand what is going on here, but I was hoping someone >>> could >>> > confirm (or explain reality if I don't) what I'm seeing. >>> > >>> > We are collecting data using a rather sizable accumulator - >>> essentially, an >>> > array of tens of thousands of entries. All told, about 1.3m of data. >>> > >>> > If I understand things correctly, it looks to me like, when our job is >>> done, >>> > a copy of this array is retrieved from each individual task, all at >>> once, >>> > for combination on the client - which means, with 400 tasks to the >>> job, each >>> > collection is using up half a gig of memory on the client. >>> > >>> > Is this true? If so, does anyone know a way to get accumulators to >>> > accumulate as results collect, rather than all at once at the end, so >>> we >>> > only have to hold a few in memory at a time, rather than all 400? >>> > >>> > Thanks, >>> > -Nathan >>> > >>> > >>> > -- >>> > Nathan Kronenfeld >>> > Senior Visualization Developer >>> > Oculus Info Inc >>> > 2 Berkeley Street, Suite 600, >>> > Toronto, Ontario M5A 4J5 >>> > Phone: +1-416-203-3003 x 238 >>> > Email: nkronenf...@oculusinfo.com >>> >> >> >> >> -- >> Nathan Kronenfeld >> Senior Visualization Developer >> Oculus Info Inc >> 2 Berkeley Street, Suite 600, >> Toronto, Ontario M5A 4J5 >> Phone: +1-416-203-3003 x 238 >> Email: nkronenf...@oculusinfo.com >> > > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Re: Another accumulator question
Hi Nathan, It sounds like what you're asking for has already been filed as https://issues.apache.org/jira/browse/SPARK-664 Does that ticket match what you're proposing? Andrew On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > We've done this with reduce - that definitely works. > > I've reworked the logic to use accumulators because, when it works, it's > 5-10x faster > > On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen wrote: > >> This sounds more like a use case for reduce? or fold? it sounds like >> you're kind of cobbling together the same function on accumulators, >> when reduce/fold are simpler and have the behavior you suggest. >> >> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld >> wrote: >> > I think I understand what is going on here, but I was hoping someone >> could >> > confirm (or explain reality if I don't) what I'm seeing. >> > >> > We are collecting data using a rather sizable accumulator - >> essentially, an >> > array of tens of thousands of entries. All told, about 1.3m of data. >> > >> > If I understand things correctly, it looks to me like, when our job is >> done, >> > a copy of this array is retrieved from each individual task, all at >> once, >> > for combination on the client - which means, with 400 tasks to the job, >> each >> > collection is using up half a gig of memory on the client. >> > >> > Is this true? If so, does anyone know a way to get accumulators to >> > accumulate as results collect, rather than all at once at the end, so we >> > only have to hold a few in memory at a time, rather than all 400? >> > >> > Thanks, >> > -Nathan >> > >> > >> > -- >> > Nathan Kronenfeld >> > Senior Visualization Developer >> > Oculus Info Inc >> > 2 Berkeley Street, Suite 600, >> > Toronto, Ontario M5A 4J5 >> > Phone: +1-416-203-3003 x 238 >> > Email: nkronenf...@oculusinfo.com >> > > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com >
Re: Another accumulator question
We've done this with reduce - that definitely works. I've reworked the logic to use accumulators because, when it works, it's 5-10x faster On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen wrote: > This sounds more like a use case for reduce? or fold? it sounds like > you're kind of cobbling together the same function on accumulators, > when reduce/fold are simpler and have the behavior you suggest. > > On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld > wrote: > > I think I understand what is going on here, but I was hoping someone > could > > confirm (or explain reality if I don't) what I'm seeing. > > > > We are collecting data using a rather sizable accumulator - essentially, > an > > array of tens of thousands of entries. All told, about 1.3m of data. > > > > If I understand things correctly, it looks to me like, when our job is > done, > > a copy of this array is retrieved from each individual task, all at once, > > for combination on the client - which means, with 400 tasks to the job, > each > > collection is using up half a gig of memory on the client. > > > > Is this true? If so, does anyone know a way to get accumulators to > > accumulate as results collect, rather than all at once at the end, so we > > only have to hold a few in memory at a time, rather than all 400? > > > > Thanks, > > -Nathan > > > > > > -- > > Nathan Kronenfeld > > Senior Visualization Developer > > Oculus Info Inc > > 2 Berkeley Street, Suite 600, > > Toronto, Ontario M5A 4J5 > > Phone: +1-416-203-3003 x 238 > > Email: nkronenf...@oculusinfo.com > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com
Re: Another accumulator question
This sounds more like a use case for reduce? or fold? it sounds like you're kind of cobbling together the same function on accumulators, when reduce/fold are simpler and have the behavior you suggest. On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld wrote: > I think I understand what is going on here, but I was hoping someone could > confirm (or explain reality if I don't) what I'm seeing. > > We are collecting data using a rather sizable accumulator - essentially, an > array of tens of thousands of entries. All told, about 1.3m of data. > > If I understand things correctly, it looks to me like, when our job is done, > a copy of this array is retrieved from each individual task, all at once, > for combination on the client - which means, with 400 tasks to the job, each > collection is using up half a gig of memory on the client. > > Is this true? If so, does anyone know a way to get accumulators to > accumulate as results collect, rather than all at once at the end, so we > only have to hold a few in memory at a time, rather than all 400? > > Thanks, > -Nathan > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Another accumulator question
I think I understand what is going on here, but I was hoping someone could confirm (or explain reality if I don't) what I'm seeing. We are collecting data using a rather sizable accumulator - essentially, an array of tens of thousands of entries. All told, about 1.3m of data. If I understand things correctly, it looks to me like, when our job is done, a copy of this array is retrieved from each individual task, all at once, for combination on the client - which means, with 400 tasks to the job, each collection is using up half a gig of memory on the client. Is this true? If so, does anyone know a way to get accumulators to accumulate as results collect, rather than all at once at the end, so we only have to hold a few in memory at a time, rather than all 400? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com