Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote:

 Hi,
 I have Spark SQL performance issue. My code contains a simple JavaBean:

 public class Person implements Externalizable {
 private int id;
 private String name;
 private double salary;
 
 }


 Apply a schema to an RDD and register table.

 JavaRDDPerson rdds = ...
 rdds.cache();

 DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
 dataFrame.registerTempTable(person);

 sqlContext.cacheTable(person);


 Run sql query.

 sqlContext.sql(SELECT id, name, salary FROM person WHERE salary = YYY
 AND salary = XXX).collectAsList()


 I launch standalone cluster which contains 4 workers. Each node runs on
 machine with 8 CPU and 15 Gb memory. When I run the query on the
 environment
 over RDD which contains 1 million persons it takes 1 minute. Somebody can
 tell me how to tuning the performance?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
 why are you cache both rdd and table?
I try to cache all the data to avoid the bad performance for the first
query. Is it right?

 Which stage of job is slow?
The query is run many times on one sqlContext and each query execution
takes 1 second.

2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com:

 Quick questions: why are you cache both rdd and table?
 Which stage of job is slow?
 On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

 Hi,
 I have Spark SQL performance issue. My code contains a simple JavaBean:

 public class Person implements Externalizable {
 private int id;
 private String name;
 private double salary;
 
 }


 Apply a schema to an RDD and register table.

 JavaRDDPerson rdds = ...
 rdds.cache();

 DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
 dataFrame.registerTempTable(person);

 sqlContext.cacheTable(person);


 Run sql query.

 sqlContext.sql(SELECT id, name, salary FROM person WHERE salary =
 YYY
 AND salary = XXX).collectAsList()


 I launch standalone cluster which contains 4 workers. Each node runs on
 machine with 8 CPU and 15 Gb memory. When I run the query on the
 environment
 over RDD which contains 1 million persons it takes 1 minute. Somebody can
 tell me how to tuning the performance?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
Hi

Can you share your Web UI, depicting your task level breakup.I can see many
thing
s that can be improved.

1. JavaRDDPerson rdds = ...rdds.cache(); -this caching is not needed as
you are not reading the rdd  for any action

2.Instead of collecting as list, if you can save as text file, it would be
better. As it would avoid moving results to the driver.

Thanks
Arush

On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

  why are you cache both rdd and table?
 I try to cache all the data to avoid the bad performance for the first
 query. Is it right?

  Which stage of job is slow?
 The query is run many times on one sqlContext and each query execution
 takes 1 second.

 2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com:

 Quick questions: why are you cache both rdd and table?
 Which stage of job is slow?
 On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

 Hi,
 I have Spark SQL performance issue. My code contains a simple JavaBean:

 public class Person implements Externalizable {
 private int id;
 private String name;
 private double salary;
 
 }


 Apply a schema to an RDD and register table.

 JavaRDDPerson rdds = ...
 rdds.cache();

 DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
 dataFrame.registerTempTable(person);

 sqlContext.cacheTable(person);


 Run sql query.

 sqlContext.sql(SELECT id, name, salary FROM person WHERE salary =
 YYY
 AND salary = XXX).collectAsList()


 I launch standalone cluster which contains 4 workers. Each node runs on
 machine with 8 CPU and 15 Gb memory. When I run the query on the
 environment
 over RDD which contains 1 million persons it takes 1 minute. Somebody can
 tell me how to tuning the performance?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com