Ok, turned out that we calculated resources for MapReduce and Tez differently and thus over-combined splits in Tez which lead to a sacrifice in the split count! However, still MapReduce outperformed Tez in a lot of runs. After multiple iterations over the issue (deployment at a customer we have limited access) things look like that:
- customer has capacity scheduler configured (configuration attached, our product uses the productA queue) - if the cluster is completely free of use, Tez outperforms MapReduce - when the cluster is in use, MapReduce seems to always outperform Tez So questions is, is there some difference in how Tez is grabbing resources from the capacity scheduler in difference to MapReduce ? Looking at the logs it looks like Tez is always very slow in starting the containers where as MapReduce parallelizes very quickly. Any thoughts on that ? Johannes
capacity-scheduler.xml
Description: XML document
> On 09 Feb 2015, at 19:24, Siddharth Seth <[email protected]> wrote: > > Johannes, > How many tasks end up running for this specific vertex ? Is it more than a > single wave of tasks (number of containers available on the cluster?). > Tez ends up allocating already running containers depending on configuration. > Tuning these may help - > tez.am.container.reuse.locality.delay-allocation-millis - Increase this to a > higher value, for re-use to be less aggressive (default is 250 (ms)) > tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback > re-use > tez.am.container.reuse.non-local-fallback.enabled - enable/disable non-local > re-use > > You could try disabling container re-use completely to see if the situation > improves. > Also - how many tasks are generated for MapReduce vs Tez ? > > Thanks > - Sid > > On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann <[email protected]> > wrote: > Hey guys, > > have a question about data locality in Tez. > Same type of input and computation logic. > Map reduce data locality: 95 % > Tez data locality: 50 % > > Having a custom InputInitializer where i’ doing like this: > > InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits); > > List<Event> events = Lists.newArrayList(); > List<TaskLocationHint> locationHints = Lists.newArrayList(); > for (InputSplit split : splits) { > > locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(), > null)); > } > VertexLocationHint locationHint = > VertexLocationHint.create(locationHints); > > InputConfigureVertexTasksEvent configureVertexEvent = > InputConfigureVertexTasksEvent.create(splits.size(), locationHint, > InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate()); > events.add(configureVertexEvent); > for (TezSplit split : splits) { > > events.add(InputDataInformationEvent.createWithSerializedPayload(events.size() > - 1, ByteBuffer.wrap(split.toByteArray()))); > } > > Any obvious flaw here ? > Or an explanation why data locality is worse ? > > best > Johannes >
