Hi, Spark support:

I am working on a research project which uses Spark on Amazon EC2 as the 
cluster foundation for distributed computation.

The project basically consumes some image and text files and project these 
files to different features and index them for later query.

Currently the image and text data are stored in S3 and our application 
distributed the projectors (convert the data to features) to worker nodes via 
SparkContext map function and the result is persisted in memory for later usage.

For the whole workflow, all the projectors are run in parallel while the 
indexing is done in sequential mode (i.e. indexing is not running on worker 
node). I tried different number of worker nodes in the experiments and notice 
that the performance increase (in terms of time-consumption) when we go choose 
to run with number of worker nodes 2, 3, 4, .... but the performance turned bad 
when 10 worker nodes are applied and worse on 12 (time-consuming going uphill 
on 10 and 12) and it turned good again for 14, 16 and 18 worker nodes 
(time-consuming going downhill again). 

I tried another task (similar thing but different program) where the 
intermediate features are used for training and finally classification is 
applied for some test data against the trained model. 
In this set of experiments we chose number of work nodes as 1, 2, 3, 4, ...10. 
The same thing happened except the performance is more bumpy as the time 
consumption went worse on 5 worker nodes than 4 and turned good on 6 worker 
nodes and turned bad on 8 worker nodes and turned back good on 9 and 10 worker 
nodes.

This is confusing as I am expecting the performance goes near-linear when we 
increased the number of worker nodes. When the overhead is more than the 
distributed gain, the performance goes bad and should not come back to good.

P. S.

I did make the below fine-tune steps:

System.setProperty("spark.scheduler.mode", "FAIR") as I want each job received 
more evenly assigned cluster resource
System.setProperty("spark.task.maxFailures", "6") as I want the worker node 
tried more times before it fail on a task
System.setProperty("spark.akka.frameSize", "250") as our communication between 
master and worker is more than the default 10M

Did I miss any other tuning steps?

Thanks,
Xiaobing

Reply via email to