Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-23 Thread tridib
Setting spark.sql.shuffle.partitions = 2000 solved my issue. I am able to
join 2 1 billion rows tables in 3 minutes.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread tridib
By skewed did you mean it's not distributed uniformly across partition?
All of my columns are string and almost of same size. i.e.

id1,field11,fields12
id2,field21,field22




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24776.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread dmytro
Could it be that your data is skewed? Do you have variable-length column
types?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using.
Yong

> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
> 
> Could it be that your data is skewed? Do you have variable-length column
> types?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

RE: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-22 Thread java8964
Or at least tell us how many partitions you are using.
Yong

> Date: Tue, 22 Sep 2015 02:06:15 -0700
> From: belevts...@gmail.com
> To: user@spark.apache.org
> Subject: Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables
> 
> Could it be that your data is skewed? Do you have variable-length column
> types?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-09-21 Thread tridib
Did you get any solution to this? I am getting same issue.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750p24759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread ayan guha
You can use custom partitioner to redistribution using partitionby
On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

 I'm currently trying to join two large tables (order 1B rows each) using
 Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
 a halt.

 I'm reading in both tables using a HiveContext with the underlying files
 stored as Parquet Files. I'm using  something along the lines of
 HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1)
 to
 set up the join.

 When I execute this (with an action such as .count) I see the first few
 stages complete, but the job eventually stalls. The GC counts keep
 increasing for each executor.

 Running with 6 workers, each with 2T disk and 100GB RAM.

 Has anyone else run into this issue? I'm thinking I might be running into
 issues with the shuffling of the data, but I'm unsure of how to get around
 this? Is there a way to redistribute the rows based on the join key first,
 and then do the join?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Nick Travers
Could you be more specific in how this is done?

A DataFrame class doesn't have that method.

On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:

 You can use custom partitioner to redistribution using partitionby
 On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

 I'm currently trying to join two large tables (order 1B rows each) using
 Spark SQL (1.3.0) and am running into long GC pauses which bring the job
 to
 a halt.

 I'm reading in both tables using a HiveContext with the underlying files
 stored as Parquet Files. I'm using  something along the lines of
 HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1)
 to
 set up the join.

 When I execute this (with an action such as .count) I see the first few
 stages complete, but the job eventually stalls. The GC counts keep
 increasing for each executor.

 Running with 6 workers, each with 2T disk and 100GB RAM.

 Has anyone else run into this issue? I'm thinking I might be running into
 issues with the shuffling of the data, but I'm unsure of how to get around
 this? Is there a way to redistribute the rows based on the join key first,
 and then do the join?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of
memory per worker you may consider running with less memory on your Worker
nodes, or splitting up the available memory on the Worker nodes amongst
several worker instances. The JVM's garbage collection starts to become
very slow as the memory allocation for the heap becomes large. At 100GB it
may not be unusual to see GC take minutes at time. I believe with Yarn or
Standalone clusters you should be able to run multiple smaller JVM
instances on your workers so you can still use your cluster resources but
also won't have a single JVM allocating an unwieldy amount of much memory.

On Mon, May 4, 2015 at 2:23 AM, Nick Travers n.e.trav...@gmail.com wrote:

 Could you be more specific in how this is done?

 A DataFrame class doesn't have that method.

 On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:

 You can use custom partitioner to redistribution using partitionby
 On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

 I'm currently trying to join two large tables (order 1B rows each) using
 Spark SQL (1.3.0) and am running into long GC pauses which bring the job
 to
 a halt.

 I'm reading in both tables using a HiveContext with the underlying files
 stored as Parquet Files. I'm using  something along the lines of
 HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 =
 b.col1) to
 set up the join.

 When I execute this (with an action such as .count) I see the first few
 stages complete, but the job eventually stalls. The GC counts keep
 increasing for each executor.

 Running with 6 workers, each with 2T disk and 100GB RAM.

 Has anyone else run into this issue? I'm thinking I might be running into
 issues with the shuffling of the data, but I'm unsure of how to get
 around
 this? Is there a way to redistribute the rows based on the join key
 first,
 and then do the join?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Michael Armbrust
If you data is evenly distributed (i.e. no skewed datapoints in your join
keys), it can also help to increase spark.sql.shuffle.partitions (default
is 200).

On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com
wrote:

 In regards to the large GC pauses, assuming you allocated all 100GB of
 memory per worker you may consider running with less memory on your Worker
 nodes, or splitting up the available memory on the Worker nodes amongst
 several worker instances. The JVM's garbage collection starts to become
 very slow as the memory allocation for the heap becomes large. At 100GB it
 may not be unusual to see GC take minutes at time. I believe with Yarn or
 Standalone clusters you should be able to run multiple smaller JVM
 instances on your workers so you can still use your cluster resources but
 also won't have a single JVM allocating an unwieldy amount of much memory.

 On Mon, May 4, 2015 at 2:23 AM, Nick Travers n.e.trav...@gmail.com
 wrote:

 Could you be more specific in how this is done?

 A DataFrame class doesn't have that method.

 On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote:

 You can use custom partitioner to redistribution using partitionby
 On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

 I'm currently trying to join two large tables (order 1B rows each) using
 Spark SQL (1.3.0) and am running into long GC pauses which bring the
 job to
 a halt.

 I'm reading in both tables using a HiveContext with the underlying files
 stored as Parquet Files. I'm using  something along the lines of
 HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 =
 b.col1) to
 set up the join.

 When I execute this (with an action such as .count) I see the first few
 stages complete, but the job eventually stalls. The GC counts keep
 increasing for each executor.

 Running with 6 workers, each with 2T disk and 100GB RAM.

 Has anyone else run into this issue? I'm thinking I might be running
 into
 issues with the shuffling of the data, but I'm unsure of how to get
 around
 this? Is there a way to redistribute the rows based on the join key
 first,
 and then do the join?

 Thanks in advance.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-03 Thread Nick Travers
I'm currently trying to join two large tables (order 1B rows each) using
Spark SQL (1.3.0) and am running into long GC pauses which bring the job to
a halt.

I'm reading in both tables using a HiveContext with the underlying files
stored as Parquet Files. I'm using  something along the lines of
HiveContext.sql(SELECT a.col1, b.col2 FROM a JOIN b ON a.col1 = b.col1) to
set up the join.

When I execute this (with an action such as .count) I see the first few
stages complete, but the job eventually stalls. The GC counts keep
increasing for each executor.

Running with 6 workers, each with 2T disk and 100GB RAM.

Has anyone else run into this issue? I'm thinking I might be running into
issues with the shuffling of the data, but I'm unsure of how to get around
this? Is there a way to redistribute the rows based on the join key first,
and then do the join?

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Long-GC-pauses-with-Spark-SQL-1-3-0-and-billion-row-tables-tp22750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org