Low Performance of Shark over Spark.

vinay . kashyap Thu, 07 Aug 2014 06:52:58 -0700

Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and
HDFS in CDH 5.1.0.
6 worker nodes each with memory 96GB and 32
cores.
I am using Shark Shell to execute queries on Spark.
I
have a raw_table ( of size 3TB with replication 3 ) which is partitioned
by year, month and day. I am running an adhoc query on one month data with
some condition.
For eg:
CREATE TABLE temp_table AS
SELECT field1,field2,field3 FROM raw_table WHERE year=2000 AND month=01
AND field10 > <some_value>;
It is claimed that the
same Hive queries can run 100x faster with shark, but I don't see such a
significant improvement when running the above query,
I am getting
almost same performance as when run in Hive which is around 45
seconds.
The same query with Impala, takes very  less time,
almost 7 times less time than shark which is around 6 seconds. I have
tried altering the below parameters for the spark jobs but did not see any
difference.
spark.local.dir                      
 

spark.serializer                    
 

spark.kryoserializer.buffer.mb  

spark.storage.memoryFraction

spark.io.compression.codec    

spark.default.parallelism
Any suggestions so that I can improve
the performance of the query with Shark over Spark and make it comparable
to Impala..??
 
Thanks and regards
Vinay
Kashyap
Low Performance of Shark over Spark.

Reply via email to