Re: Spark SQL performance issue.

Nikolay Tikhonov Thu, 23 Apr 2015 02:19:25 -0700

> why are you cache both rdd and table?
I try to cache all the data to avoid the bad performance for the first
query. Is it right?


> Which stage of job is slow?
The query is run many times on one sqlContext and each query execution
takes 1 second.

2015-04-23 11:33 GMT+03:00 ayan guha <[email protected]>:

> Quick questions: why are you cache both rdd and table?
> Which stage of job is slow?
> On 23 Apr 2015 17:12, "Nikolay Tikhonov" <[email protected]>
> wrote:
>
>> Hi,
>> I have Spark SQL performance issue. My code contains a simple JavaBean:
>>
>>     public class Person implements Externalizable {
>>         private int id;
>>         private String name;
>>         private double salary;
>>         ....................
>>     }
>>
>>
>> Apply a schema to an RDD and register table.
>>
>>     JavaRDD<Person> rdds = ...
>>     rdds.cache();
>>
>>     DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>>     dataFrame.registerTempTable("person");
>>
>>     sqlContext.cacheTable("person");
>>
>>
>> Run sql query.
>>
>>     sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >=
>> YYY
>> AND salary <= XXX").collectAsList()
>>
>>
>> I launch standalone cluster which contains 4 workers. Each node runs on
>> machine with 8 CPU and 15 Gb memory. When I run the query on the
>> environment
>> over RDD which contains 1 million persons it takes 1 minute. Somebody can
>> tell me how to tuning the performance?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: Spark SQL performance issue.

Reply via email to