Re: Spark SQL performance issue.
Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply a schema to an RDD and register table. JavaRDDPerson rdds = ... rdds.cache(); DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class); dataFrame.registerTempTable(person); sqlContext.cacheTable(person); Run sql query. sqlContext.sql(SELECT id, name, salary FROM person WHERE salary = YYY AND salary = XXX).collectAsList() I launch standalone cluster which contains 4 workers. Each node runs on machine with 8 CPU and 15 Gb memory. When I run the query on the environment over RDD which contains 1 million persons it takes 1 minute. Somebody can tell me how to tuning the performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL performance issue.
why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right? Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com: Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply a schema to an RDD and register table. JavaRDDPerson rdds = ... rdds.cache(); DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class); dataFrame.registerTempTable(person); sqlContext.cacheTable(person); Run sql query. sqlContext.sql(SELECT id, name, salary FROM person WHERE salary = YYY AND salary = XXX).collectAsList() I launch standalone cluster which contains 4 workers. Each node runs on machine with 8 CPU and 15 Gb memory. When I run the query on the environment over RDD which contains 1 million persons it takes 1 minute. Somebody can tell me how to tuning the performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL performance issue.
Hi Can you share your Web UI, depicting your task level breakup.I can see many thing s that can be improved. 1. JavaRDDPerson rdds = ...rdds.cache(); -this caching is not needed as you are not reading the rdd for any action 2.Instead of collecting as list, if you can save as text file, it would be better. As it would avoid moving results to the driver. Thanks Arush On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right? Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com: Quick questions: why are you cache both rdd and table? Which stage of job is slow? On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com wrote: Hi, I have Spark SQL performance issue. My code contains a simple JavaBean: public class Person implements Externalizable { private int id; private String name; private double salary; } Apply a schema to an RDD and register table. JavaRDDPerson rdds = ... rdds.cache(); DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class); dataFrame.registerTempTable(person); sqlContext.cacheTable(person); Run sql query. sqlContext.sql(SELECT id, name, salary FROM person WHERE salary = YYY AND salary = XXX).collectAsList() I launch standalone cluster which contains 4 workers. Each node runs on machine with 8 CPU and 15 Gb memory. When I run the query on the environment over RDD which contains 1 million persons it takes 1 minute. Somebody can tell me how to tuning the performance? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com