Re: Spark loads data from HDFS or S3

2017-12-13 Thread Sebastian Nagel
> When Spark loads data from S3 (sc.textFile('s3://...'), how all data will be > spread on Workers? The data is read by workers. Only make sure that the data is splittable, by using a splittable format or by passing a list of files sc.textFile('s3://.../*.txt') to achieve full parallelism.

[Pyspark, Python 2.7] Executor hangup caused by Unicode error while logging uncaught exception in worker

2017-04-27 Thread Sebastian Nagel
mode and traced down to this small script: https://gist.github.com/sebastian-nagel/310a5a5f39cc668fb71b6ace208706f7 Is this a known problem? Of course, one may argue that the job would have been failed anyway, but a hang-up isn't that nice, on Yarn it blocks resources (containers) un