On 7/6/14, 3:22 PM, Grandl Robert wrote:
It is possible to know for a task/vertex what is the input size it needs to
transfer from
each input task / vertex on every edge ? Similar, or the same for output ?
Yes.
<property>
<name>tez.task.generate.counters.per.io</name>
<value>true</value>
</property>
<!-- ~4x counters due to per-io -->
<property>
<name>tez.runtime.job.counters.max</name>
<value>4096</value>
</property>
I know for each task/vertex you know the input/output vertices, but I could not
find a way
to determine the input size on each edge to these vertices ?
If you are not on Hadoop-2.4.x and lacking an Application Timeline
Server install, you can instead log the same stream to HDFS using
<property>
<name>tez.simple.history.logging.dir</name>
<value>${fs.default.name}/user/gopal/tez-history/</value>
</property>
this will log the JSON event stream to whichever HDFS directory you pick.
The default record separator is Ctrl+A ('\01').
The row marked DAG_FINISHED should have all the counters in it. That
should be all you need for counters.
I use the same data pulled off ATS to generate a Sankey diagram to
analyze slow JOINs.
http://people.apache.org/~gopalv/sankey/
https://gist.github.com/t3rmin4t0r/650d0f0fc9d0cf52b43e
Cheers,
Gopal