Hi,

I am trying to reduce the query performance. I am not sure how to go about in 
shark/spark this. Here is my problem.

When I execute a query it is ran twice and here is summary. First is Filesink's 
runjob and next is mapPartitionis executed.

1.      Filesink uses only one job always is there a way to parallelize this?
2.      mapPartitionsWithIndex is taking 1.2 mins is there a way to bring this 
time down?

Time    Shuffle Read    Shuffle Write   No of Jobs      Summary
1.3 min 217.4 MB                1       runJob at FileSinkOperator.scala 157
1.2 min         219.8 MB        292     mapPartitionsWithIndex at 
Operator.scala:312

Thanks
Manjunath

Reply via email to