Re: Tuning parameters in Tez for improving performance of PIG script

Sandeep Kumar Thu, 03 Sep 2015 05:39:45 -0700

Hi Rajesh,

In RawPigLoader we are just loading files from HDFS and creating a map of
elements just like a normal PigLoader.
In MapSignallingPreProcessor step we are just reading elements from Map and
creating a tuples out of it.


PFA the DAG created by Tez for our job.

While reading records from HDFS files there are occasions when some fields
are missing and they led to NumberFormatException. Could there be any
performance issues because of these exceptions? There are 286832 exceptions.

We are reading 4GB of data split in 20 files of 200MB each. I tried with
different combination of file size e.g. 64MB, 100MB, 128MB but there is not
much difference in performance.

Regards,
Sandeep

On Thu, Sep 3, 2015 at 5:28 PM, Rajesh Balamohan <[email protected]>
wrote:

> Attaching the job swimlane based on the log you provided.  It appears that
> "SCOPE_37" itself  takes a lot of time per task attempt. Almost 80% of the
> time is occupied in processing "SCOPE_37" and there is nothing much the
> other vertex could do apart from waiting for the data from the previous
> vertex.
>
> Can you plz check if there is anything is expensive in
> MapSignallingPrePocessor / RawPigLoader?.
>
> ~Rajesh.B
>
> On Thu, Sep 3, 2015 at 5:15 PM, Sandeep Kumar <[email protected]>
> wrote:
>
>> @Rohini, Following is the step by step description of my pig script.
>>
>>
>> 1. Loading Data from HDFS.
>> 2. Flattening Map into tuples.
>> 3. Grouping data over 15 fields.
>> 4. Flatten grouped data with some additional information.
>> 5. Store into HDFS.
>>
>> Following is the dummy version of my Pig script
>>
>> r360map = LOAD 'input_200MB_each/' using
>> com.RawPigLoader('conf/R360MapSignalling_new.xml','conf/R360MapSignalling_new.json','csv');
>>
>> normalized_map_data = foreach r360map generate
>> flatten(com.MapSignallingPreProcessor($0..));
>>
>> normalized_aggr_data = GROUP normalized_map_data by
>> (a,b,c,d,e,f,g,h,i,j,k,l,m,n,o);
>>
>> normalized_sum_data = foreach normalized_aggr_data generate
>> flatten(group), COUNT(normalized_map_data),
>> SUM(normalized_map_data.txn_time);
>>
>> store normalized_sum_data into 'tmp/abc' using
>> com.MapSignallingStorageModel();
>>
>> @Rajesh, PFA the output of this command "yarn logs -applicationId appId |
>> grep "HISTORY" > history.log". Unfortunately the container logs has been
>> removed by Yarn itself hence i could not find History logs file as they are
>> created inside container logs. Let me know if you need anything more which
>> i can provide.
>>
>> Can you please tell me how to get AM logs? So, if possible then i'll get
>> it.
>>
>>
>>
>> Regards,
>> Sandeep
>>
>>
>> On Thu, Sep 3, 2015 at 4:37 PM, Rohini Palaniswamy <
>> [email protected]> wrote:
>>
>>> Sandeep,
>>>    What does your pig script do? If the pig script was just launching 1
>>> mapreduce or map only job doing simple group by, there might not be much
>>> difference except for container reuse reducing launch overhead and that too
>>> if parallelism is low, containers might not have to be reused. Can you
>>> attach a dummy version of your pig script removing/changing all sensitive
>>> information like paths or field names.
>>>
>>> Regards,
>>> Rohini
>>>
>>> On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <[email protected]>
>>> wrote:
>>>
>>>> Is it possible to upload the AM logs alone?. That would be helpful.
>>>>
>>>> It appears to be a problem with "scope_38_INPUT_scope_37". But without
>>>> the logs and without knowing the DAG, it would be hard to locate the issue.
>>>>
>>>> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
>>>> history.log".  If you have SimpleHistoryLoggingService (which is the
>>>> default), check if "history.txt" logs are available which can be shared. If
>>>> not sure about the location, check  "yarn logs -applicationId appId | |
>>>> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>>>>
>>>> ~Rajesh.B
>>>>
>>>> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <[email protected]
>>>> > wrote:
>>>>
>>>>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>>>>> performance of my script degraded.
>>>>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>>>>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>>>>
>>>>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>>>>
>>>>> PFA the task counters. Following are the version of softwares being
>>>>> used:
>>>>>
>>>>> HadoopVersion:
>>>>> 2.6.0-cdh5.4.4
>>>>>
>>>>> PigVersion:
>>>>> 0.15.1-SNAPSHOT
>>>>>
>>>>> TezVersion:
>>>>> 0.7.0
>>>>>
>>>>>
>>>>> Regards,
>>>>> Sandeep
>>>>>
>>>>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> @Rajesh, PFA the required statistics. Its difficult to share
>>>>>> application log because they are huge in size(i.e. 167MB). In case you 
>>>>>> want
>>>>>> anything specific from those logs then please let me know.
>>>>>>
>>>>>> @Rohini,
>>>>>> Thanks for suggesting regarding new version of Pig. I'll give it a
>>>>>> try for sure.
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Sandeep,
>>>>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has
>>>>>>> gone in for Pig on Tez into that release and many of them are 
>>>>>>> performance
>>>>>>> fixes.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rohini
>>>>>>>
>>>>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Can you post the application logs?  It would be helpful if you
>>>>>>>> could run with "tez.task.generate.counters.per.io=true". This
>>>>>>>> would generate the per IO statistics which can be useful for debugging.
>>>>>>>>
>>>>>>>>
>>>>>>>> ~Rajesh.B
>>>>>>>>
>>>>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig 
>>>>>>>>> scripts
>>>>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>>>>
>>>>>>>>> Following are the parameters which are in mapred-site.xml and
>>>>>>>>> being read by Tez and I'm not able to override them even if i mention 
>>>>>>>>> them
>>>>>>>>> in my tez-site.xml:
>>>>>>>>>
>>>>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>>>>  tez.runtime.compress=true
>>>>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>>>>
>>>>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>>>>
>>>>>>>>> Can you please suggest what parameters i should change to improve
>>>>>>>>> the performance of pig using Tez?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Sandeep
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Dag-OfScript
Description: Binary data

Re: Tuning parameters in Tez for improving performance of PIG script

Reply via email to