The answer is it depends :)
The fact that query runtime increases indicates more shuffle. You may want
to construct rdds based on keys you use.
You may want to specify what kind of node you are using and how many
executors you are using. You may also want to play around with executor
memory allocation s.
Best
Ayan
On 27 Apr 2015 17:59, "Mani" wrote:
> Hi,
>
> I am a graduate student from Virginia Tech (USA) pursuing my Masters in
> Computer Science. I’ve been researching on parallel and distributed
> databases and their performance for running some Range queries involving
> simple joins and group by on large datasets. As part of my research, I
> tried evaluating query performance of Spark SQL on the data set that I
> have. It would be really great if you could please confirm on the numbers
> that I get from Spark SQL? Following is the type of query that am running,
>
> Table 1 - 22,000,483 records
> Table 2 - 10,173,311 records
>
> Query : SELECT b.x, count(b.y) FROM Table1 a, Table2 b WHERE a.y=b.y AND
> a.z=‘' GROUP BY b.x ORDER BY b.x
>
> Total Running Time
> 4 Worker Nodes:177.68s
> 8 Worker Nodes: 186.72s
>
> I am using Apache Spark 1.3.0 with the default configuration. Is the query
> running time reasonable? Is it because of non-availability of indexes
> increasing the query run time? Can you please clarify?
>
> Thanks
> Mani
> Graduate Student, Department of Computer Science
> Virginia Tech
>
>
>
>
>
>
>