On 1/20/15, 3:45 PM, David Pollak wrote:

In Tez, is there a concept of shipping data back to the machine (likely not
part of the Hadoop cluster) that spawned the Tez job?

The standard practice is to write to an HDFS directory and read data back instead of opening up ports between containers and the client.

That's really an infrastructure workaround, for secure clusters without permissive firewalls.

The added plus is that you preempt the Tez containers without worrying about data loss.

(like a Spark foreach). I'm looking for the same thing, except when the job
is run via Yarn.

Tez works at a slightly different layer from the data formats as such, but the following code should give you a good idea what would happen when you translate a Driver method to .collect() into the Tez execution context.

https://github.com/hortonworks/spark-native-yarn/blob/master/src/main/scala/org/apache/spark/tez/TezJobExecutionContext.scala#L178

.forEach() is similar, almost literally a loop over a collection.

Cheers,
Gopal

Reply via email to