Performance Tuning | Shark 0.9.1 with Spark 1.0.2

N, Manjunath 3. (EXT - IN/Noida) Wed, 13 Apr 2016 04:57:08 -0700

Hi,

I am trying to reduce the query performance. I am not sure how to go about in 
shark/spark this. Here is my problem.


When I execute a query it is ran twice and here is summary. First is Filesink's 
runjob and next is mapPartitionis executed.

1.      Filesink uses only one job always is there a way to parallelize this?
2.      mapPartitionsWithIndex is taking 1.2 mins is there a way to bring this 
time down?

Time    Shuffle Read    Shuffle Write   No of Jobs      Summary
1.3 min 217.4 MB                1       runJob at FileSinkOperator.scala 157
1.2 min         219.8 MB        292     mapPartitionsWithIndex at 
Operator.scala:312

Thanks
Manjunath

Performance Tuning | Shark 0.9.1 with Spark 1.0.2

Reply via email to