Dear all,
I am using Spark 0.9.2 in Standalone mode. Hive and
HDFS in CDH 5.1.0.
6 worker nodes each with memory 96GB and 32
cores.
I am using Shark Shell to execute queries on Spark.
I
have a raw_table ( of size 3TB with replication 3 ) which is partitioned
by year, month and day. I am running an adhoc query on one month data with
some condition.
For eg:
CREATE TABLE temp_table AS
SELECT field1,field2,field3 FROM raw_table WHERE year=2000 AND month=01
AND field10 > <some_value>;
It is claimed that the
same Hive queries can run 100x faster with shark, but I don't see such a
significant improvement when running the above query,
I am getting
almost same performance as when run in Hive which is around 45
seconds.
The same query with Impala, takes very  less time,
almost 7 times less time than shark which is around 6 seconds. I have
tried altering the below parameters for the spark jobs but did not see any
difference.
spark.local.dir                      
 

spark.serializer                    
 

spark.kryoserializer.buffer.mb  

spark.storage.memoryFraction

spark.io.compression.codec    

spark.default.parallelism
Any suggestions so that I can improve
the performance of the query with Shark over Spark and make it comparable
to Impala..??
 
Thanks and regards
Vinay
Kashyap

Reply via email to