I am using a spark cluster on Amazon (launched using spark-1.6-prebuilt-with-hadoop-2.6 spark-ec2 script) to run a scala driver application to read S3 object content in parallel.
I have tried “s3n://bucket” with sc.textFile as well as set up an RDD with the S3 keys and then used java aws sdk to map it to s3client.getObject(bucket,key) call on each key to read content. In both cases, it gets quite slow as the number of objects increase to even few thousands. For example, for 8000 objects it took almost 4mins. In contrast, a custom c++ s3 http client with multiple threads to fetch them in parallel on a VM with single cpu, took 1min with 8 threads and 12secs if I use 80 threads. I observed two issues in both s3n based read, and s3client based calls: (a) Each slave in spark cluster has 2 vcpus. The job [ref:NOTE] is partitioned into #tasks, either = #objects, or = 2x#nodes. But at the node, it is not always starting 2/more executor worker threads as I expected. Instead a single worker thread serially executes the tasks one after the other. (b) Each executor worker thread seems to go through each object key serially, and the thread ends up almost always waiting on socket streaming. Has anyone seen this or am I doing something wrong? How do I make sure all the cpus are used? And how to instruct the executors to start multiple threads to process data in parallel within the partition/node/task? NOTE: since sc.textFile as well as map are transformation, I use resultRDD.count to kick off the actual read and the times are for the count job. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelism-of-task-executor-worker-threads-during-s3-reads-tp26935.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org