Hi, I am trying to reduce the query performance. I am not sure how to go about in shark/spark this. Here is my problem.
When I execute a query it is ran twice and here is summary. First is Filesink's runjob and next is mapPartitionis executed. 1. Filesink uses only one job always is there a way to parallelize this? 2. mapPartitionsWithIndex is taking 1.2 mins is there a way to bring this time down? Time Shuffle Read Shuffle Write No of Jobs Summary 1.3 min 217.4 MB 1 runJob at FileSinkOperator.scala 157 1.2 min 219.8 MB 292 mapPartitionsWithIndex at Operator.scala:312 Thanks Manjunath