Re: Running tez jobs with data in memory

Rajesh Balamohan Mon, 30 Nov 2015 16:35:49 -0800

Adding more to #2. Alternatively, you may want to consider adding paths to
HDFS in-memory tier (
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/MemoryStorage.html
).


~Rajesh.B

On Tue, Dec 1, 2015 at 5:41 AM, Rajesh Balamohan <[email protected]>
wrote:

> 1. Is it possible to determine from the tez history logs, what the
> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>
> - Vertex counters and task counters for the vertex can be looked into for
> determine this. If you have enabled ATS, this would be available in TEZ-UI
> itself. Otherwise it should be available in the job logs. However, it is
> not always directly related to compute/disk/network.  Sometimes, the vertex
> is delayed as it has to get the data from the source vertex (think of it
> more like data dependency), sometimes due to re-execution of task in the
> source vertex due to failures like disks, or due to cluster slot
> unavailability and so on.  You can also look at using CriticalPathAnalyzer
> (early version available in 0.8.x) which can help in determining the
> critical path of the DAG (to determine whether the vertex was slow due to
> different conditions). E.g HADOOP_CLASSPATH=$TEZ_HOME/*:/$TEZ_HOME/lib/*:$
> HADOOP_CLASSPATH yarn jar $TEZ_HOME/tez-job-analyzer-0.8.2-SNAPSHOT.jar
> CriticalPath --outputDir=/tmp/ --dagId=dag_1443665985063_58064_1
>
> 2. What are the common ways to get Tez work on data in memory, as opposed
> to reading from HDFS. This is to minimize the duration mappers spend in
> reading from HDFS or disk.
>
> - Not sure if you are trying to compare with Spark way of loading the data
> to memory and working on it.  Tez does not have a direct equivalent for
> this;  But Tez has ObjectRegistry (look for BroadcastAndOneToOneExample
> <https://github.com/apache/tez/blob/b153035b076d4603eb6bc771d675d64181eb02e9/tez-tests/src/main/java/org/apache/tez/mapreduce/examples/BroadcastAndOneToOneExample.java>
> in tez codebase) where data can be stored in memory to share between tasks.
>
> ~Rajesh.B
>
> On Tue, Dec 1, 2015 at 12:33 AM, Raajay <[email protected]> wrote:
>
>> Hello,
>>
>> Two questions
>>
>> 1. Is it possible to determine from the tez history logs, what the
>> bottleneck for a task/vertex is? Whether it is compute, disk or network?
>>
>> 2. What are the common ways to get Tez work on data in memory, as opposed
>> to reading from HDFS. This is to minimize the duration mappers spend in
>> reading from HDFS or disk.
>>
>> Thanks
>> Raajay
>>
>
>


-- 
~Rajesh.B

Re: Running tez jobs with data in memory

Reply via email to