Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, "Nikolay Tikhonov" <[email protected]> wrote:

> Hi,
> I have Spark SQL performance issue. My code contains a simple JavaBean:
>
>     public class Person implements Externalizable {
>         private int id;
>         private String name;
>         private double salary;
>         ....................
>     }
>
>
> Apply a schema to an RDD and register table.
>
>     JavaRDD<Person> rdds = ...
>     rdds.cache();
>
>     DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>     dataFrame.registerTempTable("person");
>
>     sqlContext.cacheTable("person");
>
>
> Run sql query.
>
>     sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= YYY
> AND salary <= XXX").collectAsList()
>
>
> I launch standalone cluster which contains 4 workers. Each node runs on
> machine with 8 CPU and 15 Gb memory. When I run the query on the
> environment
> over RDD which contains 1 million persons it takes 1 minute. Somebody can
> tell me how to tuning the performance?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to