On 7/6/14, 3:22 PM, Grandl Robert wrote:

It is possible to know for a task/vertex what is the input size it needs to 
transfer from
each input task / vertex on every edge ? Similar, or the same for output ?

Yes.

  <property>
    <name>tez.task.generate.counters.per.io</name>
    <value>true</value>
  </property>
  <!-- ~4x counters due to per-io -->
  <property>
    <name>tez.runtime.job.counters.max</name>
    <value>4096</value>
  </property>

I know for each task/vertex you know the input/output vertices, but I could not 
find a way
to determine the input size on each edge to these vertices ?

If you are not on Hadoop-2.4.x and lacking an Application Timeline Server install, you can instead log the same stream to HDFS using

  <property>
    <name>tez.simple.history.logging.dir</name>
    <value>${fs.default.name}/user/gopal/tez-history/</value>
  </property>

this will log the JSON event stream to whichever HDFS directory you pick.

The default record separator is Ctrl+A ('\01').

The row marked DAG_FINISHED should have all the counters in it. That should be all you need for counters.

I use the same data pulled off ATS to generate a Sankey diagram to analyze slow JOINs.

http://people.apache.org/~gopalv/sankey/

https://gist.github.com/t3rmin4t0r/650d0f0fc9d0cf52b43e

Cheers,
Gopal

Reply via email to