Re: Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread ayan guha
The answer is it depends :)

The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.

You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory allocation s.

Best
Ayan
On 27 Apr 2015 17:59, "Mani"  wrote:

> Hi,
>
> I am a graduate student from Virginia Tech (USA) pursuing my Masters in
> Computer Science. I’ve been researching on parallel and distributed
> databases and their performance for running some Range queries involving
> simple joins and group by on large datasets. As part of my research, I
> tried evaluating query performance of Spark SQL on the data set that I
> have. It would be really great if you could please confirm on the numbers
> that I get from Spark SQL? Following is the type of query that am running,
>
> Table 1 - 22,000,483 records
> Table 2 - 10,173,311 records
>
> Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND
> a.z=‘' GROUP BY b.x ORDER BY b.x
>
> Total Running Time
> 4 Worker Nodes:177.68s
> 8 Worker Nodes: 186.72s
>
> I am using Apache Spark 1.3.0 with the default configuration. Is the query
> running time reasonable? Is it because of non-availability of indexes
> increasing the query run time? Can you please clarify?
>
> Thanks
> Mani
> Graduate Student, Department of Computer Science
> Virginia Tech
>
>
>
>
>
>
>


Question on Spark SQL performance of Range Queries on Large Datasets

2015-04-27 Thread Mani
Hi,

I am a graduate student from Virginia Tech (USA) pursuing my Masters in 
Computer Science. I’ve been researching on parallel and distributed databases 
and their performance for running some Range queries involving simple joins and 
group by on large datasets. As part of my research, I tried evaluating query 
performance of Spark SQL on the data set that I have. It would be really great 
if you could please confirm on the numbers that I get from Spark SQL? Following 
is the type of query that am running,

Table 1 - 22,000,483 records
Table 2 - 10,173,311 records

Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND 
a.z=‘' GROUP BY b.x ORDER BY b.x

Total Running Time
4 Worker Nodes:177.68s
8 Worker Nodes: 186.72s

I am using Apache Spark 1.3.0 with the default configuration. Is the query 
running time reasonable? Is it because of non-availability of indexes 
increasing the query run time? Can you please clarify?

Thanks
Mani
Graduate Student, Department of Computer Science
Virginia Tech