Re: Tuning parameters in Tez for improving performance of PIG script

Rohini Palaniswamy Thu, 03 Sep 2015 04:09:07 -0700

Sandeep,
   What does your pig script do? If the pig script was just launching 1
mapreduce or map only job doing simple group by, there might not be much
difference except for container reuse reducing launch overhead and that too
if parallelism is low, containers might not have to be reused. Can you
attach a dummy version of your pig script removing/changing all sensitive
information like paths or field names.


Regards,
Rohini

On Thu, Sep 3, 2015 at 4:01 AM, Rajesh Balamohan <[email protected]>
wrote:

> Is it possible to upload the AM logs alone?. That would be helpful.
>
> It appears to be a problem with "scope_38_INPUT_scope_37". But without the
> logs and without knowing the DAG, it would be hard to locate the issue.
>
> Otherwise, try "yarn logs -applicationId appId | grep "HISTORY" >
> history.log".  If you have SimpleHistoryLoggingService (which is the
> default), check if "history.txt" logs are available which can be shared. If
> not sure about the location, check  "yarn logs -applicationId appId | |
> grep 'Initializing SimpleHistoryLoggingService, logFileLocation='".
>
> ~Rajesh.B
>
> On Thu, Sep 3, 2015 at 3:30 PM, Sandeep Kumar <[email protected]>
> wrote:
>
>> @Rohini, I used new version of pig i.e. 0.15.0 unfortunately the
>> performance of my script degraded.
>> 2015-09-03 15:15:24,698 [main] INFO  org.apache.pig.Main - Pig script
>> completed in 4 minutes, 1 second and 22 milliseconds (241022 ms)
>>
>> whereas earlier it was taking hardly 3 minutes and 27 seconds.
>>
>> PFA the task counters. Following are the version of softwares being used:
>>
>> HadoopVersion:
>> 2.6.0-cdh5.4.4
>>
>> PigVersion:
>> 0.15.1-SNAPSHOT
>>
>> TezVersion:
>> 0.7.0
>>
>>
>> Regards,
>> Sandeep
>>
>> On Thu, Sep 3, 2015 at 2:46 PM, Sandeep Kumar <[email protected]>
>> wrote:
>>
>>> @Rajesh, PFA the required statistics. Its difficult to share application
>>> log because they are huge in size(i.e. 167MB). In case you want anything
>>> specific from those logs then please let me know.
>>>
>>> @Rohini,
>>> Thanks for suggesting regarding new version of Pig. I'll give it a try
>>> for sure.
>>>
>>> Regards,
>>> Sandeep
>>>
>>> On Thu, Sep 3, 2015 at 2:31 PM, Rohini Palaniswamy <
>>> [email protected]> wrote:
>>>
>>>> Sandeep,
>>>>    Can you try with Pig 0.15 first? There is ton of fixes that has gone
>>>> in for Pig on Tez into that release and many of them are performance fixes.
>>>>
>>>> Regards,
>>>> Rohini
>>>>
>>>> On Thu, Sep 3, 2015 at 1:05 AM, Rajesh Balamohan <[email protected]
>>>> > wrote:
>>>>
>>>>> Can you post the application logs?  It would be helpful if you could
>>>>> run with "tez.task.generate.counters.per.io=true". This would
>>>>> generate the per IO statistics which can be useful for debugging.
>>>>>
>>>>>
>>>>> ~Rajesh.B
>>>>>
>>>>> On Thu, Sep 3, 2015 at 1:20 PM, Sandeep Kumar <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I'm using Pig-0.14.0 over Tez-0.7.0 for running some basic pig
>>>>>> scripts. I'm not able to see any performance gain using Tez. My pig 
>>>>>> scripts
>>>>>> are taking same amount of time on mapred executionType as well.
>>>>>>
>>>>>> Following are the parameters which are in mapred-site.xml and being
>>>>>> read by Tez and I'm not able to override them even if i mention them in 
>>>>>> my
>>>>>> tez-site.xml:
>>>>>>
>>>>>>  tez.runtime.shuffle.merge.percent=0.66
>>>>>>  tez.runtime.shuffle.fetch.buffer.percent=0.70
>>>>>>  tez.runtime.io.sort.mb=256
>>>>>>  tez.runtime.shuffle.memory.limit.percent=0.25
>>>>>>  tez.runtime.io.sort.factor=64
>>>>>>  tez.runtime.shuffle.connect.timeout=180000
>>>>>>  tez.runtime.internal.sorter.class=org.apache.hadoop.util.QuickSort
>>>>>>  tez.runtime.merge.progress.records=10000
>>>>>>  tez.runtime.compress=true
>>>>>>  tez.runtime.sort.spill.percent=0.8
>>>>>>  tez.runtime.shuffle.ssl.enable=false
>>>>>>  tez.runtime.ifile.readahead=true
>>>>>>  tez.runtime.shuffle.parallel.copies=10
>>>>>>  tez.runtime.ifile.readahead.bytes=4194304
>>>>>>  tez.runtime.task.input.post-merge.buffer.percent=0.0
>>>>>>  tez.runtime.shuffle.read.timeout=180000
>>>>>>  tez.runtime.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
>>>>>>
>>>>>>
>>>>>>
>>>>>> PFA the list of task counter. I can see a lot of data is being
>>>>>> spilled but if i try to increase tez.runtime.io.sort.mb through
>>>>>> mapred-site.xml then my script terminates with OOM exception.
>>>>>>
>>>>>> Can you please suggest what parameters i should change to improve the
>>>>>> performance of pig using Tez?
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Tuning parameters in Tez for improving performance of PIG script

Reply via email to