Johannes, How many tasks end up running for this specific vertex ? Is it more than a single wave of tasks (number of containers available on the cluster?). Tez ends up allocating already running containers depending on configuration. Tuning these may help - tez.am.container.reuse.locality.delay-allocation-millis - Increase this to a higher value, for re-use to be less aggressive (default is 250 (ms)) tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback re-use tez.am.container.reuse.non-local-fallback.enabled - enable/disable non-local re-use
You could try disabling container re-use completely to see if the situation improves. Also - how many tasks are generated for MapReduce vs Tez ? Thanks - Sid On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann <[email protected]> wrote: > Hey guys, > > have a question about data locality in Tez. > Same type of input and computation logic. > Map reduce data locality: 95 % > Tez data locality: 50 % > > Having a custom InputInitializer where i’ doing like this: > > InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits); > > List<Event> events = Lists.newArrayList(); > List<TaskLocationHint> locationHints = Lists.newArrayList(); > for (InputSplit split : splits) { > > locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(), > null)); > } > VertexLocationHint locationHint = > VertexLocationHint.create(locationHints); > > InputConfigureVertexTasksEvent configureVertexEvent = > InputConfigureVertexTasksEvent.create(splits.size(), locationHint, > InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate()); > events.add(configureVertexEvent); > for (TezSplit split : splits) { > > events.add(InputDataInformationEvent.createWithSerializedPayload(events.size() > - 1, ByteBuffer.wrap(split.toByteArray()))); > } > > Any obvious flaw here ? > Or an explanation why data locality is worse ? > > best > Johannes
