# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task
You can provide the number of p -----Original Message----- From: Pat Ferrel [mailto:p...@occamsmachete.com] Sent: Thursday, April 23, 2015 5:51 PM To: user@spark.apache.org Subject: Tasks run only on one machine Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-xxxxx’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The rdd is created using sc.textFile(“list of thousand files”) What would cause the read to occur only on the machine that launched the driver. Do I need to do something to the RDD after reading? Has some partition factor been applied to all derived rdds? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org