Re: data locality

Hitesh Shah Mon, 23 Feb 2015 10:00:56 -0800

Thanks for the feedback, Johannes. 

Would it be possible for you to file a jira for the performance issue that you 
are seeing with logs? Please strip out necessary data to hide the customer 
info, etc. The logs that would be most useful are:
   - comparison logs of an MR job vs a Tez job showing the container launch 
slowness/delay 
   - Tez logs from a job submitted to a busy cluster vs a free cluster


To confirm, you are using 0.5.2? 

Also, some questions on the env/job runs if you can help answer them ( before I 
jump to any possible conclusions :) ) : 
   - was the performance difference in the case where you were running multiple 
jobs concurrently submitted to the same queue? ( Given that all jobs are 
submitted to the same queue, the question on capacity scheduler preemption is 
moot)
   - are all jobs running as the same user ( from a YARN perspective i.e. 
ignoring impersonation ) ?
   - When you mention that Tez was slow at launching containers, do you know 
whether the queue had sufficient resources to launch the new containers at the 
time this was observed? 
   - If the answer to the first question was a concurrency test, were 
containers being held up by other idle AMs and therefore starving out other AMs 
that were doing useful work? 
   - What was tez configured to in terms of how long it should hold on to 
containers and how many of them ( container.idle.release-timeout* , 
session.min.held-containers properties )?

thanks
— Hitesh

On Feb 23, 2015, at 8:52 AM, Johannes Zillmann <[email protected]> wrote:

> Ok, turned out that we calculated resources for MapReduce and Tez differently 
> and thus over-combined splits in Tez which lead to a sacrifice in the split 
> count!
> However, still MapReduce outperformed Tez in a lot of runs. After multiple 
> iterations over the issue (deployment at a customer we have limited access) 
> things look like that:
> 
> - customer has capacity scheduler configured (configuration attached, our 
> product uses the productA queue)
> - if the cluster is completely free of use, Tez outperforms MapReduce
> - when the cluster is in use, MapReduce seems to always outperform Tez
> 
> So questions is, is there some difference in how Tez is grabbing resources 
> from the capacity scheduler in difference to MapReduce ?
> Looking at the logs it looks like Tez is always very slow in starting the 
> containers where as MapReduce parallelizes very quickly.
> 
> Any thoughts on that ?
> 
> Johannes 
> 
> <capacity-scheduler.xml>
>> On 09 Feb 2015, at 19:24, Siddharth Seth <[email protected]> wrote:
>> 
>> Johannes,
>> How many tasks end up running for this specific vertex ? Is it more than a 
>> single wave of tasks (number of containers available on the cluster?).
>> Tez ends up allocating already running containers depending on 
>> configuration. Tuning these may help - 
>> tez.am.container.reuse.locality.delay-allocation-millis - Increase this to a 
>> higher value, for re-use to be less aggressive (default is 250 (ms))
>> tez.am.container.reuse.rack-fallback.enabled - enable/disable rack fallback 
>> re-use
>> tez.am.container.reuse.non-local-fallback.enabled - enable/disable non-local 
>> re-use
>> 
>> You could try disabling container re-use completely to see if the situation 
>> improves.
>> Also - how many tasks are generated for MapReduce vs Tez ?
>> 
>> Thanks
>> - Sid
>> 
>> On Mon, Feb 9, 2015 at 8:18 AM, Johannes Zillmann <[email protected]> 
>> wrote:
>> Hey guys,
>> 
>> have a question about data locality in Tez.
>> Same type of input and computation logic.
>> Map reduce data locality: 95 %
>> Tez data locality: 50 %
>> 
>> Having a custom InputInitializer where i’ doing like this:
>> 
>>        InputSplit[] splits = inputFormat.getSplits(conf, desiredSplits);
>> 
>>        List<Event> events = Lists.newArrayList();
>>        List<TaskLocationHint> locationHints = Lists.newArrayList();
>>        for (InputSplit split : splits) {
>>            
>> locationHints.add(TaskLocationHint.createTaskLocationHint(split.getLocations(),
>>  null));
>>        }
>>        VertexLocationHint locationHint = 
>> VertexLocationHint.create(locationHints);
>> 
>>        InputConfigureVertexTasksEvent configureVertexEvent = 
>> InputConfigureVertexTasksEvent.create(splits.size(), locationHint, 
>> InputSpecUpdate.getDefaultSinglePhysicalInputSpecUpdate());
>>        events.add(configureVertexEvent);
>>        for (TezSplit split : splits) {
>>          
>> events.add(InputDataInformationEvent.createWithSerializedPayload(events.size()
>>  - 1, ByteBuffer.wrap(split.toByteArray())));
>>       }
>> 
>> Any obvious flaw here ?
>> Or an explanation why data locality is worse ?
>> 
>> best
>> Johannes
>> 
>

Re: data locality

Reply via email to