Thank you guys.
Just to confirm that I understood correctly. I can get the input/output size received/sent for a task with respect to input/output vertices(not tasks in that vertices) using the settings Gopal have mentioned before. At least, this is what I see. Another dumb question: A vertex can have multiple tasks(not task attempts), for different input blocks, right ? So a vertex entity is kind of a stage abstraction, not a task abstraction, right ? Robert On Sunday, July 6, 2014 4:44 PM, Gopal V <[email protected]> wrote: On 7/6/14, 3:22 PM, Grandl Robert wrote: > It is possible to know for a task/vertex what is the input size it needs to > transfer from > each input task / vertex on every edge ? Similar, or the same for output ? Yes. <property> <name>tez.task.generate.counters.per.io</name> <value>true</value> </property> <!-- ~4x counters due to per-io --> <property> <name>tez.runtime.job.counters.max</name> <value>4096</value> </property> > I know for each task/vertex you know the input/output vertices, but I could > not find a way > to determine the input size on each edge to these vertices ? If you are not on Hadoop-2.4.x and lacking an Application Timeline Server install, you can instead log the same stream to HDFS using <property> <name>tez.simple.history.logging.dir</name> <value>${fs.default.name}/user/gopal/tez-history/</value> </property> this will log the JSON event stream to whichever HDFS directory you pick. The default record separator is Ctrl+A ('\01'). The row marked DAG_FINISHED should have all the counters in it. That should be all you need for counters. I use the same data pulled off ATS to generate a Sankey diagram to analyze slow JOINs. http://people.apache.org/~gopalv/sankey/ https://gist.github.com/t3rmin4t0r/650d0f0fc9d0cf52b43e Cheers, Gopal
