> why are you cache both rdd and table? I try to cache all the data to avoid the bad performance for the first query. Is it right?
> Which stage of job is slow? The query is run many times on one sqlContext and each query execution takes 1 second. 2015-04-23 11:33 GMT+03:00 ayan guha <guha.a...@gmail.com>: > Quick questions: why are you cache both rdd and table? > Which stage of job is slow? > On 23 Apr 2015 17:12, "Nikolay Tikhonov" <tikhonovnico...@gmail.com> > wrote: > >> Hi, >> I have Spark SQL performance issue. My code contains a simple JavaBean: >> >> public class Person implements Externalizable { >> private int id; >> private String name; >> private double salary; >> .................... >> } >> >> >> Apply a schema to an RDD and register table. >> >> JavaRDD<Person> rdds = ... >> rdds.cache(); >> >> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class); >> dataFrame.registerTempTable("person"); >> >> sqlContext.cacheTable("person"); >> >> >> Run sql query. >> >> sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= >> YYY >> AND salary <= XXX").collectAsList() >> >> >> I launch standalone cluster which contains 4 workers. Each node runs on >> machine with 8 CPU and 15 Gb memory. When I run the query on the >> environment >> over RDD which contains 1 million persons it takes 1 minute. Somebody can >> tell me how to tuning the performance? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>