Re: get input size for each task

Grandl Robert Mon, 07 Jul 2014 09:41:28 -0700

Thank you guys.

Just to confirm that I understood correctly. I can get the input/output size 
received/sent for a task with respect to input/output vertices(not tasks in 
that vertices) using the settings Gopal have mentioned before. At least, this 
is what I see. 

Another dumb question: A vertex can have multiple tasks(not task attempts), for 
different input blocks, right ? So a vertex entity is kind of a stage 
abstraction, not a task abstraction, right ?

Robert

On Sunday, July 6, 2014 4:44 PM, Gopal V <[email protected]> wrote:

On 7/6/14, 3:22 PM, Grandl Robert wrote:

> It is possible to know for a task/vertex what is the input size it needs to 
> transfer from
> each input task / vertex on every edge ? Similar, or the same for output ?

Yes.

   <property>
     <name>tez.task.generate.counters.per.io</name>
     <value>true</value>
   </property>
   <!-- ~4x counters due to per-io -->
   <property>
     <name>tez.runtime.job.counters.max</name>
     <value>4096</value>
   </property>

> I know for each task/vertex you know the input/output vertices, but I could 
> not find a way
> to determine the input size on each edge to these vertices ?

If you are not on Hadoop-2.4.x and lacking an Application Timeline 
Server install, you can instead log the same stream to HDFS using

   <property>
     <name>tez.simple.history.logging.dir</name>
     <value>${fs.default.name}/user/gopal/tez-history/</value>
   </property>

this will log the JSON event stream to whichever HDFS directory you pick.

The default record separator is Ctrl+A ('\01').

The row marked DAG_FINISHED should have all the counters in it. That 
should be all you need for counters.

I use the same data pulled off ATS to generate a Sankey diagram to 
analyze slow JOINs.

http://people.apache.org/~gopalv/sankey/

https://gist.github.com/t3rmin4t0r/650d0f0fc9d0cf52b43e

Cheers,
Gopal

Re: get input size for each task

Reply via email to