Johannes,
How many tasks end up running for this specific vertex ? Is it more than a
single wave of tasks (number of containers available on the cluster?).
Tez ends up allocating already running containers depending on
configuration. Tuning these may help -
tez.am.container.reuse.locality.delay-allocation-millis - Increase this to
a higher value, for re-use to be less aggressive (default is 250 (ms))
tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback
re-use
tez.am.container.reuse.non-local-fallback.enabled - enable/disable
non-local re-use

You could try disabling container re-use completely to see if the situation
improves.
Also - how many tasks are generated for MapReduce vs Tez ?

Thanks
- Sid

On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann <[email protected]>
wrote:

> Hey guys,
>
> have a question about data locality in Tez.
> Same type of input and computation logic.
> Map reduce data locality: 95 %
> Tez data locality: 50 %
>
> Having a custom InputInitializer where i’ doing like this:
>
>         InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits);
>
>         List<Event> events = Lists.newArrayList();
>         List<TaskLocationHint> locationHints = Lists.newArrayList();
>         for (InputSplit split : splits) {
>
> locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(),
> null));
>         }
>         VertexLocationHint locationHint =
> VertexLocationHint.create(locationHints);
>
>         InputConfigureVertexTasksEvent configureVertexEvent =
> InputConfigureVertexTasksEvent.create(splits.size(), locationHint,
> InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate());
>         events.add(configureVertexEvent);
>         for (TezSplit split : splits) {
>
> events.add(InputDataInformationEvent.createWithSerializedPayload(events.size()
> - 1, ByteBuffer.wrap(split.toByteArray())));
>        }
>
> Any obvious flaw here ?
> Or an explanation why data locality is worse ?
>
> best
> Johannes

Reply via email to