subject:"Re\: Scalability of group by"

RE: Scalability of group by

2015-04-28 Thread Ulanov, Alexander

Richard,

The same problem is with sort.

I have enough disk space and tmp folder. The errors in logs tell out of memory. 
I wonder what does it hold in memory?

Alexander

From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Tuesday, April 28, 2015 7:34 AM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Scalability of group by

Hi,

I can offer a few ideas to investigate in regards to your issue here. I've run 
into resource issues doing shuffle operations with a much smaller dataset than 
2B. The data is going to be saved to disk by the BlockManager as part of the 
shuffle and then redistributed across the cluster as relevant to the group by. 
So the data is going to be replicated during the operation.

I might suggest trying to allocate more memory for your executors in your 
cluster. You might also want to look into configuring more explicitly the 
shuffle functionality 
(https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior). 
Check the disk usage on the worker nodes, in our case we actually had small 
disk space to start with and were running out of temporary space for the 
shuffle operation.

I believe you should also be able to find more clear errors in logs from the 
worker nodes if you haven't checked yet.

On Mon, Apr 27, 2015 at 10:02 PM, Ulanov, Alexander 
mailto:alexander.ula...@hp.com>> wrote:
It works on a smaller dataset of 100 rows. Probably I could find the size when 
it fails using binary search. However, it would not help me because I need to 
work with 2B rows.

From: ayan guha [mailto:guha.a...@gmail.com<mailto:guha.a...@gmail.com>]
Sent: Monday, April 27, 2015 6:58 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: Scalability of group by


Hi

Can you test on a smaller dataset to identify if it is cluster issue or scaling 
issue in spark
On 28 Apr 2015 11:30, "Ulanov, Alexander" 
mailto:alexander.ula...@hp.com>> wrote:
Hi,

I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in 
Spark 1.3 as follows:
“select id, time, first(value) from data group by id, time”

My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is 
allocated with 5GB of memory. However, all executors are being lost during the 
query execution and I get “ExecutorLostFailure”.

Could you suggest what might be the reason for it? Could it be that “group by” 
is implemented as RDD.groupBy so it holds the group by result in memory? What 
is the workaround?

Best regards, Alexander

Re: Scalability of group by

2015-04-28 Thread Richard Marscher

Hi,

I can offer a few ideas to investigate in regards to your issue here. I've
run into resource issues doing shuffle operations with a much smaller
dataset than 2B. The data is going to be saved to disk by the BlockManager
as part of the shuffle and then redistributed across the cluster as
relevant to the group by. So the data is going to be replicated during the
operation.

I might suggest trying to allocate more memory for your executors in your
cluster. You might also want to look into configuring more explicitly the
shuffle functionality (
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior).
Check the disk usage on the worker nodes, in our case we actually had small
disk space to start with and were running out of temporary space for the
shuffle operation.

I believe you should also be able to find more clear errors in logs from
the worker nodes if you haven't checked yet.

On Mon, Apr 27, 2015 at 10:02 PM, Ulanov, Alexander  wrote:

>  It works on a smaller dataset of 100 rows. Probably I could find the
> size when it fails using binary search. However, it would not help me
> because I need to work with 2B rows.
>
>
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Monday, April 27, 2015 6:58 PM
> *To:* Ulanov, Alexander
> *Cc:* user@spark.apache.org
> *Subject:* Re: Scalability of group by
>
>
>
> Hi
>
> Can you test on a smaller dataset to identify if it is cluster issue or
> scaling issue in spark
>
> On 28 Apr 2015 11:30, "Ulanov, Alexander"  wrote:
>
>  Hi,
>
>
>
> I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
> in Spark 1.3 as follows:
>
> “select id, time, first(value) from data group by id, time”
>
>
>
> My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor
> is allocated with 5GB of memory. However, all executors are being lost
> during the query execution and I get “ExecutorLostFailure”.
>
>
>
> Could you suggest what might be the reason for it? Could it be that “group
> by” is implemented as RDD.groupBy so it holds the group by result in
> memory? What is the workaround?
>
>
>
> Best regards, Alexander
>
>

RE: Scalability of group by

2015-04-27 Thread Ulanov, Alexander

It works on a smaller dataset of 100 rows. Probably I could find the size when 
it fails using binary search. However, it would not help me because I need to 
work with 2B rows.

From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Monday, April 27, 2015 6:58 PM
To: Ulanov, Alexander
Cc: user@spark.apache.org
Subject: Re: Scalability of group by

Hi

Can you test on a smaller dataset to identify if it is cluster issue or scaling 
issue in spark
On 28 Apr 2015 11:30, "Ulanov, Alexander" 
mailto:alexander.ula...@hp.com>> wrote:
Hi,

I am running a group by on a dataset of 2B of RDD[Row [id, time, value]] in 
Spark 1.3 as follows:
“select id, time, first(value) from data group by id, time”

My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor is 
allocated with 5GB of memory. However, all executors are being lost during the 
query execution and I get “ExecutorLostFailure”.

Could you suggest what might be the reason for it? Could it be that “group by” 
is implemented as RDD.groupBy so it holds the group by result in memory? What 
is the workaround?

Best regards, Alexander

Re: Scalability of group by

2015-04-27 Thread ayan guha

Hi

Can you test on a smaller dataset to identify if it is cluster issue or
scaling issue in spark
On 28 Apr 2015 11:30, "Ulanov, Alexander"  wrote:

>  Hi,
>
>
>
> I am running a group by on a dataset of 2B of RDD[Row [id, time, value]]
> in Spark 1.3 as follows:
>
> “select id, time, first(value) from data group by id, time”
>
>
>
> My cluster is 8 nodes with 16GB RAM and one worker per node. Each executor
> is allocated with 5GB of memory. However, all executors are being lost
> during the query execution and I get “ExecutorLostFailure”.
>
>
>
> Could you suggest what might be the reason for it? Could it be that “group
> by” is implemented as RDD.groupBy so it holds the group by result in
> memory? What is the workaround?
>
>
>
> Best regards, Alexander
>

RE: Scalability of group by

Re: Scalability of group by

RE: Scalability of group by

Re: Scalability of group by

4 matches

Site Navigation

Mail list logo

Footer information