Having the runtime be a lot shorter but the MB-seconds be similar can 
definitely happen.  For example, consider the legacy MapReduce pipeline where 
it runs a chain of jobs.  Each job in the chain cannot start until the previous 
job finishes.  Most jobs have at least some skew where the whole job cannot 
complete until that last, longer-running task completes.  Finally it completes 
and the next job starts.  The next job has a long tail which keeps the next job 
from starting, etc. Note that while the last task of a job is running the 
incremental MB-seconds is pretty small since only one task of the job is 
running at that time.  The overall latency is large to all these last-task 
stalls, but the footprint profile isn't a rectangle of total tasks * total 
runtime.  Since Tez executes the DAG directly, there are no artificial "job 
boundaries" between vertices in the DAG, so it can do a better job of 
pipelining the work.  In some cases approximately the same amount of work is 
getting done with legacy vs. Tez, so the overall MB-seconds value is going to 
be similar.  However since Tez was able to pipeline the work more efficiently 
the overall job latency significantly improved.  Therefore job latency is not a 
very good proxy for job footprint, because the width of the job at any point in 
time can be drastically different.
Container reuse in Tez will always result in at least some containers being 
allocated and then quickly released.  It's a natural race inherent with 
container reuse.  When a new task needs to be allocated, a request is sent out 
to the RM to allocate a container for that task.  If the task gets reassigned 
to a reused container then the RM may grant the allocation before the Tez AM 
can cancel the allocation request.  Now the Tez AM has an extra container, and 
if it cannot find another task to run with it then it will end up releasing it. 
 Note that Tez also will hold onto containers for a small amount of time as it 
tries to match them first for node locality, then a little later for rack 
locality, then a little later for any task.

Without having access to the Tez AM logs, it's going to be hard to know for 
sure what is happening with the job.  Given that the unused container footprint 
was only 5% of the total job footprint, I suspect this is simply a case of 
pipelining the work for better latency but approximately the same amount of 
work is getting done.  Calculating the MB-second footprint from the tasks-only 
perspective (i.e.: sum of task_size * task_runtime for the entire job) would 
help verify.  I'd also look for Tez tasks spending a long period of time 
waiting for shuffle data to become ready.  You could try setting the task 
slowstart setting to 1.0 so that it only launches downstream tasks when all the 
upstream tasks are done, so the tasks avoid sitting on the cluster for a long 
time waiting for their final shuffle input to become available.
Jason
 

    On Thursday, March 16, 2017 7:01 PM, Piyush Narang <pnar...@twitter.com> 
wrote:
 

 hi folks,
I'm trying to compare the performance of a Scalding job on Tez vs Hadoop and 
understand a bit about the resource usage overheads on Tez. This job reads 
around 6TB of input data and sets up a Scalding flow with 13 Hadoop jobs. (20 
Tez vertices)
The runtime of the job on Tez is around 2-3x better than that on Hadoop. Tez 
run takes 15 mins (container reuse on) and 25 mins (reuse off). The Hadoop job 
on the other hand takes around 48 mins. 
Looking at the resource usage Megabyte Millis(total container uptime * 
allocatedMemory), it seems like Tez is at-par with Hadoop (0.59% worse). This 
seems strange given that Tez is running for 2-3x lesser time. I'm trying to 
understand why this is the case and I could use any ideas / suggestions that 
folks might have. 
Some things I have checked out / confirmed:   
   - For the purpose of these tests, my container sizes in Hadoop and Tez are 
identical (4G).
   - Total no of containers:
   
   - Hadoop seems to be spinning up around 46K containers over the 13 jobs.
   - Tez spins up 32K tasks (and around 35K containers when reuse=false, and 5 
- 10K containers when reuse=true)
   
   - While looking at the run with reuse=false, on comparing the container 
runtime durations with the corresponding tasks's runtime, I noticed that a 
decent number of containers were run for around 1s or more longer than the task 
as part of it. Is the Tez AM holding on to containers for a duration of time 
before starting up tasks on them? This contributes around 5% of the MB_MILLIS 
in case of the Tez run. 
   - Tez seems to be spinning up around 3K extra containers that are very short 
lived. This doesn't contribute a lot of overhead but seemed interesting.
   - I tried out a couple of minor speculative execution setting tweaks 
(tez.shuffle-vertex-manager.max-src-fraction=0.95, min-src-fraction=0.9) this 
made MB_MILLIS a little worse (~10%).
I looked at some of the resource usage related jiras like TEZ-3274 and 
TEZ-3535. We're not hitting TEZ-3274. Wasn't sure based on reading TEZ-3535 if 
that might be a problem here. 

Does anyone have suggestions what we could look into / explore? Not sure if 
others have run into such scenarios while comparing resource usage numbers on 
Hadoop / Tez where the runtimes are much better on Tez but usage isn't too much 
better. 
Thanks, 
-- 
- Piyush

   

Reply via email to