> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be
> spread on Workers?
The data is read by workers. Only make sure that the data is splittable, by
using a splittable
format or by passing a list of files
sc.textFile('s3://.../*.txt')
to achieve full parallelism.
mode and traced down to
this small script:
https://gist.github.com/sebastian-nagel/310a5a5f39cc668fb71b6ace208706f7
Is this a known problem?
Of course, one may argue that the job would have been failed anyway, but a
hang-up isn't that nice,
on Yarn it blocks resources (containers) un