subject:"Another accumulator question"

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld

I"m not sure if it's an exact match, or just very close :-)

I don't think our problem is the workload on the driver, I think it's just
memory - so while the solution proposed there would work, it would also be
sufficient for our purposes, I believe, simply to clear each block as soon
as it's added into the canonical version, and try to do so as soon as
possible - but I could be misunderstanding some of the timing, I'm still
investigating.

Though to combine on the worker before returning, as he suggests, would
probably be even better.

On Fri, Nov 21, 2014 at 6:08 PM, Andrew Ash  wrote:

> Hi Nathan,
>
> It sounds like what you're asking for has already been filed as
> https://issues.apache.org/jira/browse/SPARK-664  Does that ticket match
> what you're proposing?
>
> Andrew
>
> On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> We've done this with reduce - that definitely works.
>>
>> I've reworked the logic to use accumulators because, when it works, it's
>> 5-10x faster
>>
>> On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen  wrote:
>>
>>> This sounds more like a use case for reduce? or fold? it sounds like
>>> you're kind of cobbling together the same function on accumulators,
>>> when reduce/fold are simpler and have the behavior you suggest.
>>>
>>> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
>>>  wrote:
>>> > I think I understand what is going on here, but I was hoping someone
>>> could
>>> > confirm (or explain reality if I don't) what I'm seeing.
>>> >
>>> > We are collecting data using a rather sizable accumulator -
>>> essentially, an
>>> > array of tens of thousands of entries.  All told, about 1.3m of data.
>>> >
>>> > If I understand things correctly, it looks to me like, when our job is
>>> done,
>>> > a copy of this array is retrieved from each individual task, all at
>>> once,
>>> > for combination on the client - which means, with 400 tasks to the
>>> job, each
>>> > collection is using up half a gig of memory on the client.
>>> >
>>> > Is this true?  If so, does anyone know a way to get accumulators to
>>> > accumulate as results collect, rather than all at once at the end, so
>>> we
>>> > only have to hold a few in memory at a time, rather than all 400?
>>> >
>>> > Thanks,
>>> >   -Nathan
>>> >
>>> >
>>> > --
>>> > Nathan Kronenfeld
>>> > Senior Visualization Developer
>>> > Oculus Info Inc
>>> > 2 Berkeley Street, Suite 600,
>>> > Toronto, Ontario M5A 4J5
>>> > Phone:  +1-416-203-3003 x 238
>>> > Email:  nkronenf...@oculusinfo.com
>>>
>>
>>
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenf...@oculusinfo.com
>>
>
>


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: Another accumulator question

2014-11-21 Thread Andrew Ash

Hi Nathan,

It sounds like what you're asking for has already been filed as
https://issues.apache.org/jira/browse/SPARK-664  Does that ticket match
what you're proposing?

Andrew

On Fri, Nov 21, 2014 at 12:29 PM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> We've done this with reduce - that definitely works.
>
> I've reworked the logic to use accumulators because, when it works, it's
> 5-10x faster
>
> On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen  wrote:
>
>> This sounds more like a use case for reduce? or fold? it sounds like
>> you're kind of cobbling together the same function on accumulators,
>> when reduce/fold are simpler and have the behavior you suggest.
>>
>> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
>>  wrote:
>> > I think I understand what is going on here, but I was hoping someone
>> could
>> > confirm (or explain reality if I don't) what I'm seeing.
>> >
>> > We are collecting data using a rather sizable accumulator -
>> essentially, an
>> > array of tens of thousands of entries.  All told, about 1.3m of data.
>> >
>> > If I understand things correctly, it looks to me like, when our job is
>> done,
>> > a copy of this array is retrieved from each individual task, all at
>> once,
>> > for combination on the client - which means, with 400 tasks to the job,
>> each
>> > collection is using up half a gig of memory on the client.
>> >
>> > Is this true?  If so, does anyone know a way to get accumulators to
>> > accumulate as results collect, rather than all at once at the end, so we
>> > only have to hold a few in memory at a time, rather than all 400?
>> >
>> > Thanks,
>> >   -Nathan
>> >
>> >
>> > --
>> > Nathan Kronenfeld
>> > Senior Visualization Developer
>> > Oculus Info Inc
>> > 2 Berkeley Street, Suite 600,
>> > Toronto, Ontario M5A 4J5
>> > Phone:  +1-416-203-3003 x 238
>> > Email:  nkronenf...@oculusinfo.com
>>
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>

Re: Another accumulator question

2014-11-21 Thread Nathan Kronenfeld

We've done this with reduce - that definitely works.

I've reworked the logic to use accumulators because, when it works, it's
5-10x faster

On Fri, Nov 21, 2014 at 4:44 AM, Sean Owen  wrote:

> This sounds more like a use case for reduce? or fold? it sounds like
> you're kind of cobbling together the same function on accumulators,
> when reduce/fold are simpler and have the behavior you suggest.
>
> On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
>  wrote:
> > I think I understand what is going on here, but I was hoping someone
> could
> > confirm (or explain reality if I don't) what I'm seeing.
> >
> > We are collecting data using a rather sizable accumulator - essentially,
> an
> > array of tens of thousands of entries.  All told, about 1.3m of data.
> >
> > If I understand things correctly, it looks to me like, when our job is
> done,
> > a copy of this array is retrieved from each individual task, all at once,
> > for combination on the client - which means, with 400 tasks to the job,
> each
> > collection is using up half a gig of memory on the client.
> >
> > Is this true?  If so, does anyone know a way to get accumulators to
> > accumulate as results collect, rather than all at once at the end, so we
> > only have to hold a few in memory at a time, rather than all 400?
> >
> > Thanks,
> >   -Nathan
> >
> >
> > --
> > Nathan Kronenfeld
> > Senior Visualization Developer
> > Oculus Info Inc
> > 2 Berkeley Street, Suite 600,
> > Toronto, Ontario M5A 4J5
> > Phone:  +1-416-203-3003 x 238
> > Email:  nkronenf...@oculusinfo.com
>



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: Another accumulator question

2014-11-21 Thread Sean Owen

This sounds more like a use case for reduce? or fold? it sounds like
you're kind of cobbling together the same function on accumulators,
when reduce/fold are simpler and have the behavior you suggest.

On Fri, Nov 21, 2014 at 5:46 AM, Nathan Kronenfeld
 wrote:
> I think I understand what is going on here, but I was hoping someone could
> confirm (or explain reality if I don't) what I'm seeing.
>
> We are collecting data using a rather sizable accumulator - essentially, an
> array of tens of thousands of entries.  All told, about 1.3m of data.
>
> If I understand things correctly, it looks to me like, when our job is done,
> a copy of this array is retrieved from each individual task, all at once,
> for combination on the client - which means, with 400 tasks to the job, each
> collection is using up half a gig of memory on the client.
>
> Is this true?  If so, does anyone know a way to get accumulators to
> accumulate as results collect, rather than all at once at the end, so we
> only have to hold a few in memory at a time, rather than all 400?
>
> Thanks,
>   -Nathan
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Another accumulator question

2014-11-20 Thread Nathan Kronenfeld

I think I understand what is going on here, but I was hoping someone could
confirm (or explain reality if I don't) what I'm seeing.

We are collecting data using a rather sizable accumulator - essentially, an
array of tens of thousands of entries.  All told, about 1.3m of data.

If I understand things correctly, it looks to me like, when our job is
done, a copy of this array is retrieved from each individual task, all at
once, for combination on the client - which means, with 400 tasks to the
job, each collection is using up half a gig of memory on the client.

Is this true?  If so, does anyone know a way to get accumulators to
accumulate as results collect, rather than all at once at the end, so we
only have to hold a few in memory at a time, rather than all 400?

Thanks,
  -Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Re: Another accumulator question

Re: Another accumulator question

Re: Another accumulator question

Re: Another accumulator question

Another accumulator question

5 matches

Site Navigation

Mail list logo

Footer information