Hi, Spark support:
I am working on a research project which uses Spark on Amazon EC2 as the
cluster foundation for distributed computation.
The project basically consumes some image and text files and project these
files to different features and index them for later query.
Currently the image and text data are stored in S3 and our application
distributed the projectors (convert the data to features) to worker nodes via
SparkContext map function and the result is persisted in memory for later usage.
For the whole workflow, all the projectors are run in parallel while the
indexing is done in sequential mode (i.e. indexing is not running on worker
node). I tried different number of worker nodes in the experiments and notice
that the performance increase (in terms of time-consumption) when we go choose
to run with number of worker nodes 2, 3, 4, .... but the performance turned bad
when 10 worker nodes are applied and worse on 12 (time-consuming going uphill
on 10 and 12) and it turned good again for 14, 16 and 18 worker nodes
(time-consuming going downhill again).
I tried another task (similar thing but different program) where the
intermediate features are used for training and finally classification is
applied for some test data against the trained model.
In this set of experiments we chose number of work nodes as 1, 2, 3, 4, ...10.
The same thing happened except the performance is more bumpy as the time
consumption went worse on 5 worker nodes than 4 and turned good on 6 worker
nodes and turned bad on 8 worker nodes and turned back good on 9 and 10 worker
nodes.
This is confusing as I am expecting the performance goes near-linear when we
increased the number of worker nodes. When the overhead is more than the
distributed gain, the performance goes bad and should not come back to good.
P. S.
I did make the below fine-tune steps:
System.setProperty("spark.scheduler.mode", "FAIR") as I want each job received
more evenly assigned cluster resource
System.setProperty("spark.task.maxFailures", "6") as I want the worker node
tried more times before it fail on a task
System.setProperty("spark.akka.frameSize", "250") as our communication between
master and worker is more than the default 10M
Did I miss any other tuning steps?
Thanks,
Xiaobing