There's no canonical way to do this as I understand. For instance, when
running under YARN, you have completely no idea where your containers would
be started. Moreover, if one of the containers would fail, it might be
restarted on another machine so the machine number might change at runtime
To
I've wanted similar functionality too: when network IO bound (for me I was
trying to pull things from s3 to hdfs) I wish there was a `.mapMachines`
api where I wouldn't have to try guess at the proper partitioning of a
'driver' RDD for `sc.parallelize(1 to N, N).map( i= pull the i'th chunk
from S3
What's the canonical way to find out the number of physical machines in a
cluster at runtime in Spark? I believe SparkContext.defaultParallelism will
give me the number of cores, but I'm interested in the number of NICs.
I'm writing a Spark streaming application to ingest from Kafka with the