Ok, turned out that we calculated resources for MapReduce and Tez differently 
and thus over-combined splits in Tez which lead to a sacrifice in the split 
count!
However, still MapReduce outperformed Tez in a lot of runs. After multiple 
iterations over the issue (deployment at a customer we have limited access) 
things look like that:

- customer has capacity scheduler configured (configuration attached, our 
product uses the productA queue)
- if the cluster is completely free of use, Tez outperforms MapReduce
- when the cluster is in use, MapReduce seems to always outperform Tez

So questions is, is there some difference in how Tez is grabbing resources from 
the capacity scheduler in difference to MapReduce ?
Looking at the logs it looks like Tez is always very slow in starting the 
containers where as MapReduce parallelizes very quickly.

Any thoughts on that ?

Johannes 

Attachment: capacity-scheduler.xml
Description: XML document

> On 09 Feb 2015, at 19:24, Siddharth Seth <[email protected]> wrote:
> 
> Johannes,
> How many tasks end up running for this specific vertex ? Is it more than a 
> single wave of tasks (number of containers available on the cluster?).
> Tez ends up allocating already running containers depending on configuration. 
> Tuning these may help - 
> tez.am.container.reuse.locality.delay-allocation-millis - Increase this to a 
> higher value, for re-use to be less aggressive (default is 250 (ms))
> tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback 
> re-use
> tez.am.container.reuse.non-local-fallback.enabled - enable/disable non-local 
> re-use
> 
> You could try disabling container re-use completely to see if the situation 
> improves.
> Also - how many tasks are generated for MapReduce vs Tez ?
> 
> Thanks
> - Sid
> 
> On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann <[email protected]> 
> wrote:
> Hey guys,
> 
> have a question about data locality in Tez.
> Same type of input and computation logic.
> Map reduce data locality: 95 %
> Tez data locality: 50 %
> 
> Having a custom InputInitializer where i’ doing like this:
> 
>         InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits);
> 
>         List<Event> events = Lists.newArrayList();
>         List<TaskLocationHint> locationHints = Lists.newArrayList();
>         for (InputSplit split : splits) {
>             
> locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(),
>  null));
>         }
>         VertexLocationHint locationHint = 
> VertexLocationHint.create(locationHints);
> 
>         InputConfigureVertexTasksEvent configureVertexEvent = 
> InputConfigureVertexTasksEvent.create(splits.size(), locationHint, 
> InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate());
>         events.add(configureVertexEvent);
>         for (TezSplit split : splits) {
>           
> events.add(InputDataInformationEvent.createWithSerializedPayload(events.size()
>  - 1, ByteBuffer.wrap(split.toByteArray())));
>        }
> 
> Any obvious flaw here ?
> Or an explanation why data locality is worse ?
> 
> best
> Johannes
> 

Reply via email to