Sure var columns = mc.textFile(source).map { line => line.split(delimiter) }
Here “source” is a comma delimited list of files or directories. Both the textFile and .map tasks happen only on the machine they were launched from. Later other distributed operations happen but I suspect if I can figure out why the fist line is run only on the client machine the rest will clear up too. Here are some subsequent lines. if(filterColumn != -1) { columns = columns.filter { tokens => tokens(filterColumn) == filterBy } } val interactions = columns.map { tokens => tokens(rowIDColumn) -> tokens(columnIDPosition) } interactions.cache() On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <gangele...@gmail.com> wrote: Will you be able to paste code here? On 23 April 2015 at 22:21, Pat Ferrel <p...@occamsmachete.com <mailto:p...@occamsmachete.com>> wrote: Using Spark streaming to create a large volume of small nano-batch input files, ~4k per file, thousands of ‘part-xxxxx’ files. When reading the nano-batch files and doing a distributed calculation my tasks run only on the machine where it was launched. I’m launching in “yarn-client” mode. The rdd is created using sc.textFile(“list of thousand files”) What would cause the read to occur only on the machine that launched the driver. Do I need to do something to the RDD after reading? Has some partition factor been applied to all derived rdds? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org>