>> Hi, Spark support:
>> 
>> I am working on a research project which uses Spark on Amazon EC2 as the 
>> cluster foundation for distributed computation.
>> 
>> The project basically consumes some image and text files and project these 
>> files to different features and index them for later query.
>> 
>> Currently the image and text data are stored in S3 and our application 
>> distributed the projectors (convert the data to features) to worker nodes 
>> via SparkContext map function and the result is persisted in memory for 
>> later usage.
>> 
>> For the whole workflow, all the projectors are run in parallel while the 
>> indexing is done in sequential mode (i.e. indexing is not running on worker 
>> node). I tried different number of worker nodes in the experiments and 
>> notice that the performance increase (in terms of time-consumption) when we 
>> go choose to run with number of worker nodes 2, 3, 4, .... but the 
>> performance turned bad when 10 worker nodes are applied and worse on 12 
>> (time-consuming going uphill on 10 and 12) and it turned good again for 14, 
>> 16 and 18 worker nodes (time-consuming going downhill again). 
>> 
>> I tried another task (similar thing but different program) where the 
>> intermediate features are used for training and finally classification is 
>> applied for some test data against the trained model. 
>> In this set of experiments we chose number of work nodes as 1, 2, 3, 4, 
>> ...10. The same thing happened except the performance is more bumpy as the 
>> time consumption went worse on 5 worker nodes than 4 and turned good on 6 
>> worker nodes and turned bad on 8 worker nodes and turned back good on 9 and 
>> 10 worker nodes.
>> 
>> This is confusing as I am expecting the performance goes near-linear when we 
>> increased the number of worker nodes. When the overhead is more than the 
>> distributed gain, the performance goes bad and should not come back to good.
>> 
>> P. S.
>> 
>> I did make the below fine-tune steps:
>> 
>> System.setProperty("spark.scheduler.mode", "FAIR") as I want each job 
>> received more evenly assigned cluster resource
>> System.setProperty("spark.task.maxFailures", "6") as I want the worker node 
>> tried more times before it fail on a task
>> System.setProperty("spark.akka.frameSize", "250") as our communication 
>> between master and worker is more than the default 10M
>> 
>> Did I miss any other tuning steps?
>> 
>> Thanks,
>> Xiaobing

Reply via email to