Hello, I am trying to use the RDD pipe method to integrate Spark with external commands to be executed on each partition. My program roughly looks like:
rdd.pipe(cmd1).pipe(cmd2) The output of cmd1 and input of cmd2 is raw binary data. However, the pipe method in RDD requires converting data to strings, which implies loss of data between the two command calls. I am now thinking of extending RDD.scala and PipedRDD.scala so as to give end-user direct user to the PrintWriter created in PipedRDD. Is there any better solution to do this ? --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org