>> Hi, Spark support:
>>
>> I am working on a research project which uses Spark on Amazon EC2 as the
>> cluster foundation for distributed computation.
>>
>> The project basically consumes some image and text files and project these
>> files to different features and index them for later query.
>>
>> Currently the image and text data are stored in S3 and our application
>> distributed the projectors (convert the data to features) to worker nodes
>> via SparkContext map function and the result is persisted in memory for
>> later usage.
>>
>> For the whole workflow, all the projectors are run in parallel while the
>> indexing is done in sequential mode (i.e. indexing is not running on worker
>> node). I tried different number of worker nodes in the experiments and
>> notice that the performance increase (in terms of time-consumption) when we
>> go choose to run with number of worker nodes 2, 3, 4, .... but the
>> performance turned bad when 10 worker nodes are applied and worse on 12
>> (time-consuming going uphill on 10 and 12) and it turned good again for 14,
>> 16 and 18 worker nodes (time-consuming going downhill again).
>>
>> I tried another task (similar thing but different program) where the
>> intermediate features are used for training and finally classification is
>> applied for some test data against the trained model.
>> In this set of experiments we chose number of work nodes as 1, 2, 3, 4,
>> ...10. The same thing happened except the performance is more bumpy as the
>> time consumption went worse on 5 worker nodes than 4 and turned good on 6
>> worker nodes and turned bad on 8 worker nodes and turned back good on 9 and
>> 10 worker nodes.
>>
>> This is confusing as I am expecting the performance goes near-linear when we
>> increased the number of worker nodes. When the overhead is more than the
>> distributed gain, the performance goes bad and should not come back to good.
>>
>> P. S.
>>
>> I did make the below fine-tune steps:
>>
>> System.setProperty("spark.scheduler.mode", "FAIR") as I want each job
>> received more evenly assigned cluster resource
>> System.setProperty("spark.task.maxFailures", "6") as I want the worker node
>> tried more times before it fail on a task
>> System.setProperty("spark.akka.frameSize", "250") as our communication
>> between master and worker is more than the default 10M
>>
>> Did I miss any other tuning steps?
>>
>> Thanks,
>> Xiaobing