Spark[SqL] performance tuning

2020-11-12 Thread Lakshmi Nivedita
Hi all,

I have pyspark sql script with loading of one table 80mb and one is 2 mb
and rest 3 are small tables performing lots of joins in the script to fetch
the data.

My system configuration is

4 nodes,300 GB,64 cores

To write a data frame into table 24Mb size records . System is taking 4
minutes 2 sec. with parameters

Driver memory -5G
Executor memory-20 G
Executor cores 5
Number of executors 40
Dynamicalloction.minexecutors 40
Max executors 40
Dynamic initial executors 17
Memory overhead 4G

With default partition 200

Could you please any one suggest me how can I tune this code.

-- 
k.Lakshmi Nivedita


Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :)

The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.

You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory allocation s.

Best
Ayan
On 27 Apr 2015 17:59, "Mani"  wrote:

> Hi,
>
> I am a graduate student from Virginia Tech (USA) pursuing my Masters in
> Computer Science. I’ve been researching on parallel and distributed
> databases and their performance for running some Range queries involving
> simple joins and group by on large datasets. As part of my research, I
> tried evaluating query performance of Spark SQL on the data set that I
> have. It would be really great if you could please confirm on the numbers
> that I get from Spark SQL? Following is the type of query that am running,
>
> Table 1 - 22,000,483 records
> Table 2 - 10,173,311 records
>
> Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND
> a.z=‘' GROUP BY b.x ORDER BY b.x
>
> Total Running Time
> 4 Worker Nodes:177.68s
> 8 Worker Nodes: 186.72s
>
> I am using Apache Spark 1.3.0 with the default configuration. Is the query
> running time reasonable? Is it because of non-availability of indexes
> increasing the query run time? Can you please clarify?
>
> Thanks
> Mani
> Graduate Student, Department of Computer Science
> Virginia Tech
>
>
>
>
>
>
>


Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread Mani
Hi,

I am a graduate student from Virginia Tech (USA) pursuing my Masters in 
Computer Science. I’ve been researching on parallel and distributed databases 
and their performance for running some Range queries involving simple joins and 
group by on large datasets. As part of my research, I tried evaluating query 
performance of Spark SQL on the data set that I have. It would be really great 
if you could please confirm on the numbers that I get from Spark SQL? Following 
is the type of query that am running,

Table 1 - 22,000,483 records
Table 2 - 10,173,311 records

Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND 
a.z=‘' GROUP BY b.x ORDER BY b.x

Total Running Time
4 Worker Nodes:177.68s
8 Worker Nodes: 186.72s

I am using Apache Spark 1.3.0 with the default configuration. Is the query 
running time reasonable? Is it because of non-availability of indexes 
increasing the query run time? Can you please clarify?

Thanks
Mani
Graduate Student, Department of Computer Science
Virginia Tech








Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
Hi

Can you share your Web UI, depicting your task level breakup.I can see many
thing
s that can be improved.

1. JavaRDD rdds = ...rdds.cache(); ->this caching is not needed as
you are not reading the rdd  for any action

2.Instead of collecting as list, if you can save as text file, it would be
better. As it would avoid moving results to the driver.

Thanks
Arush

On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov  wrote:

> > why are you cache both rdd and table?
> I try to cache all the data to avoid the bad performance for the first
> query. Is it right?
>
> > Which stage of job is slow?
> The query is run many times on one sqlContext and each query execution
> takes 1 second.
>
> 2015-04-23 11:33 GMT+03:00 ayan guha :
>
>> Quick questions: why are you cache both rdd and table?
>> Which stage of job is slow?
>> On 23 Apr 2015 17:12, "Nikolay Tikhonov" 
>> wrote:
>>
>>> Hi,
>>> I have Spark SQL performance issue. My code contains a simple JavaBean:
>>>
>>> public class Person implements Externalizable {
>>> private int id;
>>> private String name;
>>> private double salary;
>>> 
>>> }
>>>
>>>
>>> Apply a schema to an RDD and register table.
>>>
>>> JavaRDD rdds = ...
>>> rdds.cache();
>>>
>>> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>>> dataFrame.registerTempTable("person");
>>>
>>> sqlContext.cacheTable("person");
>>>
>>>
>>> Run sql query.
>>>
>>> sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >=
>>> YYY
>>> AND salary <= XXX").collectAsList()
>>>
>>>
>>> I launch standalone cluster which contains 4 workers. Each node runs on
>>> machine with 8 CPU and 15 Gb memory. When I run the query on the
>>> environment
>>> over RDD which contains 1 million persons it takes 1 minute. Somebody can
>>> tell me how to tuning the performance?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
> why are you cache both rdd and table?
I try to cache all the data to avoid the bad performance for the first
query. Is it right?

> Which stage of job is slow?
The query is run many times on one sqlContext and each query execution
takes 1 second.

2015-04-23 11:33 GMT+03:00 ayan guha :

> Quick questions: why are you cache both rdd and table?
> Which stage of job is slow?
> On 23 Apr 2015 17:12, "Nikolay Tikhonov" 
> wrote:
>
>> Hi,
>> I have Spark SQL performance issue. My code contains a simple JavaBean:
>>
>> public class Person implements Externalizable {
>> private int id;
>> private String name;
>> private double salary;
>> 
>> }
>>
>>
>> Apply a schema to an RDD and register table.
>>
>> JavaRDD rdds = ...
>> rdds.cache();
>>
>> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
>> dataFrame.registerTempTable("person");
>>
>> sqlContext.cacheTable("person");
>>
>>
>> Run sql query.
>>
>> sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >=
>> YYY
>> AND salary <= XXX").collectAsList()
>>
>>
>> I launch standalone cluster which contains 4 workers. Each node runs on
>> machine with 8 CPU and 15 Gb memory. When I run the query on the
>> environment
>> over RDD which contains 1 million persons it takes 1 minute. Somebody can
>> tell me how to tuning the performance?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Spark SQL performance issue.

2015-04-23 Thread ayan guha
Quick questions: why are you cache both rdd and table?
Which stage of job is slow?
On 23 Apr 2015 17:12, "Nikolay Tikhonov"  wrote:

> Hi,
> I have Spark SQL performance issue. My code contains a simple JavaBean:
>
> public class Person implements Externalizable {
> private int id;
> private String name;
> private double salary;
> 
> }
>
>
> Apply a schema to an RDD and register table.
>
> JavaRDD rdds = ...
> rdds.cache();
>
> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
> dataFrame.registerTempTable("person");
>
> sqlContext.cacheTable("person");
>
>
> Run sql query.
>
> sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= YYY
> AND salary <= XXX").collectAsList()
>
>
> I launch standalone cluster which contains 4 workers. Each node runs on
> machine with 8 CPU and 15 Gb memory. When I run the query on the
> environment
> over RDD which contains 1 million persons it takes 1 minute. Somebody can
> tell me how to tuning the performance?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark SQL performance issue.

2015-04-23 Thread Nikolay Tikhonov
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:

public class Person implements Externalizable {
private int id;
private String name;
private double salary;

}


Apply a schema to an RDD and register table.

JavaRDD rdds = ...
rdds.cache();

DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
dataFrame.registerTempTable("person");

sqlContext.cacheTable("person");


Run sql query.

sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= YYY
AND salary <= XXX").collectAsList()


I launch standalone cluster which contains 4 workers. Each node runs on
machine with 8 CPU and 15 Gb memory. When I run the query on the environment
over RDD which contains 1 million persons it takes 1 minute. Somebody can
tell me how to tuning the performance?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL performance issue.

2015-04-22 Thread Nikolay Tikhonov
Hi,
I have Spark SQL performance issue. My code contains a simple JavaBean:

public class Person implements Externalizable {
> private int id;
> private String name;
> private double salary;
> 
> }
>

Apply a schema to an RDD and register table.

JavaRDD rdds = ...
> rdds.cache();
>
> DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
> dataFrame.registerTempTable("person");
>
> sqlContext.cacheTable("person");
>

Run sql query.

sqlContext.sql("SELECT id, name, salary FROM person WHERE salary >= YYY AND
> salary <= XXX").collectAsList()
>

I launch standalone cluster which contains 4 workers. Each node runs on
machine with 8 CPU and 15 Gb memory. When I run the query on the
environment over RDD which contains 100 it takes 1 minute. Somebody can
tell me how to tuning the performance?


RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil.
I am having 4 cores cpu.(2.4 ghz)

Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 1:07 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance

You can see where it is spending time, whether there is any GC Time etc from 
the webUI (running on 4040),Also how many cores are you having?

Thanks
Best Regards

On Fri, Mar 13, 2015 at 1:05 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Additionally I wanted to tell that presently I was running the query on one 
machine with 3gm ram and the join query was taking around 6 seconds.

Thanks,
Udbhav Agarwal

From: Udbhav Agarwal
Sent: 13 March, 2015 12:45 PM
To: 'Akhil Das'
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark sql performance

Okay Akhil! Thanks for the information.

Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:34 PM

To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

Can't say that unless you try it.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Sounds great!
So can I expect response time in milliseconds from the join query over this 
much data ( 0.5 million in each table) ?


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:27 PM

To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

So you can cache upto 8GB of data in memory (hope your data size of one table 
is < 2GB), then it should be pretty fast with SparkSQL. Also i'm assuming you 
have around 12-16 cores total.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 
columns each in two tables with 0.5 million records. I want to perform join 
query on these two tables.


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:16 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

The size/type of your data, and your cluster configuration would be fine i 
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal








Re: spark sql performance

2015-03-13 Thread Akhil Das
You can see where it is spending time, whether there is any GC Time etc
from the webUI (running on 4040),Also how many cores are you having?

Thanks
Best Regards

On Fri, Mar 13, 2015 at 1:05 PM, Udbhav Agarwal 
wrote:

>  Additionally I wanted to tell that presently I was running the query on
> one machine with 3gm ram and the join query was taking around 6 seconds.
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Udbhav Agarwal
> *Sent:* 13 March, 2015 12:45 PM
> *To:* 'Akhil Das'
> *Cc:* user@spark.apache.org
> *Subject:* RE: spark sql performance
>
>
>
> Okay Akhil! Thanks for the information.
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com
> ]
> *Sent:* 13 March, 2015 12:34 PM
>
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> Can't say that unless you try it.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Sounds great!
>
> So can I expect response time in milliseconds from the join query over
> this much data ( 0.5 million in each table) ?
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:27 PM
>
>
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> So you can cache upto 8GB of data in memory (hope your data size of one
> table is < 2GB), then it should be pretty fast with SparkSQL. Also i'm
> assuming you have around 12-16 cores total.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Lets say am using 4 machines with 3gb ram. My data is customers records
> with 5 columns each in two tables with 0.5 million records. I want to
> perform join query on these two tables.
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:16 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> The size/type of your data, and your cluster configuration would be fine i
> think.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Thanks Akhil,
>
> What more info should I give so we can estimate query time in my scenario?
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:01 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> That totally depends on your data size and your cluster setup.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Hi,
>
> What is query time for join query on hbase with spark sql. Say tables in
> hbase have 0.5 million records each. I am expecting a query time (latency)
> in milliseconds with spark sql. Can this be possible ?
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
>
>
>
>
>
>
>
>


RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Additionally I wanted to tell that presently I was running the query on one 
machine with 3gm ram and the join query was taking around 6 seconds.

Thanks,
Udbhav Agarwal

From: Udbhav Agarwal
Sent: 13 March, 2015 12:45 PM
To: 'Akhil Das'
Cc: user@spark.apache.org
Subject: RE: spark sql performance

Okay Akhil! Thanks for the information.

Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:34 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

Can't say that unless you try it.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Sounds great!
So can I expect response time in milliseconds from the join query over this 
much data ( 0.5 million in each table) ?


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:27 PM

To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

So you can cache upto 8GB of data in memory (hope your data size of one table 
is < 2GB), then it should be pretty fast with SparkSQL. Also i'm assuming you 
have around 12-16 cores total.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 
columns each in two tables with 0.5 million records. I want to perform join 
query on these two tables.


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:16 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

The size/type of your data, and your cluster configuration would be fine i 
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal







RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Okay Akhil! Thanks for the information.

Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:34 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance

Can't say that unless you try it.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Sounds great!
So can I expect response time in milliseconds from the join query over this 
much data ( 0.5 million in each table) ?


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:27 PM

To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

So you can cache upto 8GB of data in memory (hope your data size of one table 
is < 2GB), then it should be pretty fast with SparkSQL. Also i'm assuming you 
have around 12-16 cores total.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 
columns each in two tables with 0.5 million records. I want to perform join 
query on these two tables.


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:16 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

The size/type of your data, and your cluster configuration would be fine i 
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal







Re: spark sql performance

2015-03-13 Thread Akhil Das
Can't say that unless you try it.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:32 PM, Udbhav Agarwal  wrote:

>  Sounds great!
>
> So can I expect response time in milliseconds from the join query over
> this much data ( 0.5 million in each table) ?
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:27 PM
>
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> So you can cache upto 8GB of data in memory (hope your data size of one
> table is < 2GB), then it should be pretty fast with SparkSQL. Also i'm
> assuming you have around 12-16 cores total.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Lets say am using 4 machines with 3gb ram. My data is customers records
> with 5 columns each in two tables with 0.5 million records. I want to
> perform join query on these two tables.
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:16 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> The size/type of your data, and your cluster configuration would be fine i
> think.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Thanks Akhil,
>
> What more info should I give so we can estimate query time in my scenario?
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:01 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> That totally depends on your data size and your cluster setup.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Hi,
>
> What is query time for join query on hbase with spark sql. Say tables in
> hbase have 0.5 million records each. I am expecting a query time (latency)
> in milliseconds with spark sql. Can this be possible ?
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
>
>
>
>
>
>


RE: spark sql performance

2015-03-13 Thread Udbhav Agarwal
Sounds great!
So can I expect response time in milliseconds from the join query over this 
much data ( 0.5 million in each table) ?


Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:27 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance

So you can cache upto 8GB of data in memory (hope your data size of one table 
is < 2GB), then it should be pretty fast with SparkSQL. Also i'm assuming you 
have around 12-16 cores total.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 
columns each in two tables with 0.5 million records. I want to perform join 
query on these two tables.


Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:16 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

The size/type of your data, and your cluster configuration would be fine i 
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal






Re: spark sql performance

2015-03-12 Thread Akhil Das
So you can cache upto 8GB of data in memory (hope your data size of one
table is < 2GB), then it should be pretty fast with SparkSQL. Also i'm
assuming you have around 12-16 cores total.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:22 PM, Udbhav Agarwal  wrote:

>  Lets say am using 4 machines with 3gb ram. My data is customers records
> with 5 columns each in two tables with 0.5 million records. I want to
> perform join query on these two tables.
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:16 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> The size/type of your data, and your cluster configuration would be fine i
> think.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Thanks Akhil,
>
> What more info should I give so we can estimate query time in my scenario?
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:01 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> That totally depends on your data size and your cluster setup.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Hi,
>
> What is query time for join query on hbase with spark sql. Say tables in
> hbase have 0.5 million records each. I am expecting a query time (latency)
> in milliseconds with spark sql. Can this be possible ?
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
>
>
>
>


RE: spark sql performance

2015-03-12 Thread Udbhav Agarwal
Lets say am using 4 machines with 3gb ram. My data is customers records with 5 
columns each in two tables with 0.5 million records. I want to perform join 
query on these two tables.


Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:16 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance

The size/type of your data, and your cluster configuration would be fine i 
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das 
[mailto:ak...@sigmoidanalytics.com<mailto:ak...@sigmoidanalytics.com>]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal





Re: spark sql performance

2015-03-12 Thread Akhil Das
The size/type of your data, and your cluster configuration would be fine i
think.

Thanks
Best Regards

On Fri, Mar 13, 2015 at 12:07 PM, Udbhav Agarwal  wrote:

>  Thanks Akhil,
>
> What more info should I give so we can estimate query time in my scenario?
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
> *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com]
> *Sent:* 13 March, 2015 12:01 PM
> *To:* Udbhav Agarwal
> *Cc:* user@spark.apache.org
> *Subject:* Re: spark sql performance
>
>
>
> That totally depends on your data size and your cluster setup.
>
>
>   Thanks
>
> Best Regards
>
>
>
> On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal <
> udbhav.agar...@syncoms.com> wrote:
>
>  Hi,
>
> What is query time for join query on hbase with spark sql. Say tables in
> hbase have 0.5 million records each. I am expecting a query time (latency)
> in milliseconds with spark sql. Can this be possible ?
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
>
>


RE: spark sql performance

2015-03-12 Thread Udbhav Agarwal
Thanks Akhil,
What more info should I give so we can estimate query time in my scenario?

Thanks,
Udbhav Agarwal

From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: 13 March, 2015 12:01 PM
To: Udbhav Agarwal
Cc: user@spark.apache.org
Subject: Re: spark sql performance

That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
mailto:udbhav.agar...@syncoms.com>> wrote:
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal




Re: spark sql performance

2015-03-12 Thread Akhil Das
That totally depends on your data size and your cluster setup.

Thanks
Best Regards

On Thu, Mar 12, 2015 at 7:32 PM, Udbhav Agarwal 
wrote:

>  Hi,
>
> What is query time for join query on hbase with spark sql. Say tables in
> hbase have 0.5 million records each. I am expecting a query time (latency)
> in milliseconds with spark sql. Can this be possible ?
>
>
>
>
>
>
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>


spark sql performance

2015-03-12 Thread Udbhav Agarwal
Hi,
What is query time for join query on hbase with spark sql. Say tables in hbase 
have 0.5 million records each. I am expecting a query time (latency) in 
milliseconds with spark sql. Can this be possible ?




Thanks,
Udbhav Agarwal



RE: Spark SQL performance and data size constraints

2014-11-26 Thread Cheng, Hao
Spark SQL doesn't support the DISTINCT well currently, particularly the case 
you described, it will leads all of the data fall into a single node and keep 
them in memory only.
Dev community actually has solutions for this, it probably will be solved after 
the release of Spark 1.2.

-Original Message-
From: SK [mailto:skrishna...@gmail.com] 
Sent: Wednesday, November 26, 2014 4:17 PM
To: u...@spark.incubator.apache.org
Subject: Spark SQL performance and data size constraints

Hi,

I use the following code to read in data and extract the unique users using 
Spark SQL. The data is 1.2 TB and I am running this on a cluster with 3 TB 
memory. It appears that there is enough memory, but the program just freezes 
after sometime where it maps the rdd to the case class Play.  (If I dont use 
the Spark SQL portion (i.e dont map to the case class and register the table
etc.)  and merely load the data (first 3 lines of the code below) then the 
program completes.)

I  tried with  spark.storage.memoryFraction=0.5 and 0.6 (default) as suggested 
in the Tuning guide. but that did not help.
According to the logs, total # of partitions/tasks is 38688 and size of each 
rdd partition for the mapping to the case class is around 31 MB. So total rdd 
size is 38688*31 = 1.2 TB. This is less than the 3 TB memory on the cluster. At 
the time the program stops, the total number of tasks is a little < 38688 with 
some of them appearing as failed. There are no details for why the tasks 
failed. 

Are there any maximum data size constraints in Spark SQL or table creation that 
might be causing the program to hang? Are there any performance optimizations I 
could try with Spark SQL that might allow the completion of the task?


 val data = sc.textFile("shared_dir/*.dat")
.map(_.split("\t"))
.persist(MEMORY_AND_DISK_SER)


 val play = data.map(f => Play(f(0).trim,f(1).trim, f(2).trim,
f(3).trim))
   .persist(MEMORY_AND_DISK_SER)

 // register the RDD as a table
 play.registerTempTable("play")

 val ids = sql_cxt.sql("SELECT  DISTINCT id  FROM play")

 println("Number of unique account ID = %d".format(ids.count()))
 println("Number of RDDs = %d".format(play.count()))

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-and-data-size-constraints-tp19843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL performance and data size constraints

2014-11-26 Thread SK
Hi,

I use the following code to read in data and extract the unique users using
Spark SQL. The data is 1.2 TB and I am running this on a cluster with 3 TB
memory. It appears that there is enough memory, but the program just freezes
after sometime where it maps the rdd to the case class Play.  (If I dont use
the Spark SQL portion (i.e dont map to the case class and register the table
etc.)  and merely load the data (first 3 lines of the code below) then the
program completes.)

I  tried with  spark.storage.memoryFraction=0.5 and 0.6 (default) as
suggested in the Tuning guide. but that did not help.
According to the logs, total # of partitions/tasks is 38688 and size of each
rdd partition for the mapping to the case class is around 31 MB. So total
rdd size is 38688*31 = 1.2 TB. This is less than the 3 TB memory on the
cluster. At the time the program stops, the total number of tasks is a
little < 38688 with some of them appearing as failed. There are no details
for why the tasks failed. 

Are there any maximum data size constraints in Spark SQL or table creation
that might be causing the program to hang? Are there any performance
optimizations I could try with Spark SQL that might allow the completion of
the task?


 val data = sc.textFile("shared_dir/*.dat")
.map(_.split("\t"))
.persist(MEMORY_AND_DISK_SER)


 val play = data.map(f => Play(f(0).trim,f(1).trim, f(2).trim,
f(3).trim))
   .persist(MEMORY_AND_DISK_SER)

 // register the RDD as a table
 play.registerTempTable("play")

 val ids = sql_cxt.sql("SELECT  DISTINCT id  FROM play")

 println("Number of unique account ID = %d".format(ids.count()))
 println("Number of RDDs = %d".format(play.count()))

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-and-data-size-constraints-tp19843.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org