[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1274:
--
Fix Version/s: 0.7.0

 Remove Key/Value type checks in IFile
 -

 Key: TEZ-1274
 URL: https://issues.apache.org/jira/browse/TEZ-1274
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Rajesh Balamohan
 Fix For: 0.7.0

 Attachments: TEZ-1274.1.patch


 We check key and value types for each record - this should be removed from 
 the tight loop. Maybe an assertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1931) Publish tez version info to Timeline

2015-01-08 Thread Hitesh Shah (JIRA)
Hitesh Shah created TEZ-1931:


 Summary: Publish tez version info to Timeline
 Key: TEZ-1931
 URL: https://issues.apache.org/jira/browse/TEZ-1931
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
Priority: Critical


We are not publishing any version info to Timeline. This will be useful to 
compare different dags/apps over time and also to catch issues if needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline

2015-01-08 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270392#comment-14270392
 ] 

Hitesh Shah commented on TEZ-1931:
--

[~jeagles] Should have a patch soon. Are you ok with putting this into 0.6.0 ? 

 Publish tez version info to Timeline
 

 Key: TEZ-1931
 URL: https://issues.apache.org/jira/browse/TEZ-1931
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
Priority: Critical

 We are not publishing any version info to Timeline. This will be useful to 
 compare different dags/apps over time and also to catch issues if needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1932) Add Prakash Ramachandran to team list

2015-01-08 Thread Prakash Ramachandran (JIRA)
Prakash Ramachandran created TEZ-1932:
-

 Summary: Add Prakash Ramachandran to team list
 Key: TEZ-1932
 URL: https://issues.apache.org/jira/browse/TEZ-1932
 Project: Apache Tez
  Issue Type: Bug
Reporter: Prakash Ramachandran
Assignee: Prakash Ramachandran
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1932) Add Prakash Ramachandran to team list

2015-01-08 Thread Prakash Ramachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prakash Ramachandran updated TEZ-1932:
--
Attachment: TEZ-1932.1.patch

[~rajesh.balamohan] can you review 

 Add Prakash Ramachandran to team list
 -

 Key: TEZ-1932
 URL: https://issues.apache.org/jira/browse/TEZ-1932
 Project: Apache Tez
  Issue Type: Bug
Reporter: Prakash Ramachandran
Assignee: Prakash Ramachandran
Priority: Minor
 Attachments: TEZ-1932.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-01-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270281#comment-14270281
 ] 

Siddharth Seth commented on TEZ-1923:
-

+1. Looks good. Thanks [~rajesh.balamohan]
Minor
{code}+  LOG.info(Starting inMemoryMerger's merge since commitMemory= +
+  commitMemory +   mergeThreshold= + mergeThreshold +
+  . Current usedMemory= + usedMemory);
{code}
This log line can be misleading since startMemToDiskMerge may not start the 
merge if another is already running. Should be after the condition in 
startMemToDiskMerge.

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  

[jira] [Commented] (TEZ-1932) Add Prakash Ramachandran to team list

2015-01-08 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270516#comment-14270516
 ] 

Rajesh Balamohan commented on TEZ-1932:
---

+1

 Add Prakash Ramachandran to team list
 -

 Key: TEZ-1932
 URL: https://issues.apache.org/jira/browse/TEZ-1932
 Project: Apache Tez
  Issue Type: Bug
Reporter: Prakash Ramachandran
Assignee: Prakash Ramachandran
Priority: Minor
 Attachments: TEZ-1932.1.patch






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1421) MRCombiner throws NPE in MapredWordCount on master branch

2015-01-08 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270711#comment-14270711
 ] 

Tsuyoshi OZAWA commented on TEZ-1421:
-

Sure, wait a moment.

 MRCombiner throws NPE in MapredWordCount on master branch
 -

 Key: TEZ-1421
 URL: https://issues.apache.org/jira/browse/TEZ-1421
 Project: Apache Tez
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA
Priority: Blocker

 I tested MapredWordCount against 70GB generated by RandowTextWriter. When a 
 Combiner runs, it throws NPE. It looks setCombinerClass doesn't work 
 correctly.
 {quote}
 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
 at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:131)
 at 
 org.apache.tez.mapreduce.combine.MRCombiner.runOldCombiner(MRCombiner.java:122)
 at org.apache.tez.mapreduce.combine.MRCombiner.combine(MRCombiner.java:112)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.MergeManager.runCombineProcessor(MergeManager.java:472)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.MergeManager$InMemoryMerger.merge(MergeManager.java:605)
 at 
 org.apache.tez.runtime.library.common.shuffle.impl.MergeThread.run(MergeThread.java:89)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1923:
--
Attachment: TEZ-1923.4.patch

Right; Incorporated it in the latest patch. Will check in to master asap. 

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)

[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1923:
--
Fix Version/s: 0.7.0

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Fix For: 0.7.0

 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch, 
 TEZ-1923.4.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)
 {code}
 Attaching the sample hive command just for 

[jira] [Commented] (TEZ-1931) Publish tez version info to Timeline

2015-01-08 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270637#comment-14270637
 ] 

Jonathan Eagles commented on TEZ-1931:
--

This will be good to go into 0.6.0, [~hitesh]

 Publish tez version info to Timeline
 

 Key: TEZ-1931
 URL: https://issues.apache.org/jira/browse/TEZ-1931
 Project: Apache Tez
  Issue Type: Bug
Reporter: Hitesh Shah
Assignee: Hitesh Shah
Priority: Critical

 We are not publishing any version info to Timeline. This will be useful to 
 compare different dags/apps over time and also to catch issues if needed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1923:
--
Attachment: TEZ-1923.3.patch

Addressing review comments.  

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch, TEZ-1923.3.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are given below
 {code}
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:48,332 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1551867234, memoryLimit=1073741824, commitMemory=883028388, 
 mergeThreshold=708669632  === InMemoryMerge would be started in this case 
 as commitMemory = mergeThreshold
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:52,900 INFO 
 [fetcher [Map_1] #2] orderedgrouped.MergeManager: 
 Patch..usedMemory=1273349784, memoryLimit=1073741824, commitMemory=347296632, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released. InMemoryMerge will not kick in and not release memory.
 syslog_attempt_142126204_0201_1_01_34_0:2015-01-07 02:05:53,163 INFO 
 [fetcher [Map_1] #1] orderedgrouped.MergeManager: 
 Patch..usedMemory=1191994052, memoryLimit=1073741824, commitMemory=523155206, 
 mergeThreshold=708669632 === InMemoryMerge would *NOT* be started in this 
 case as commitMemory  mergeThreshold.  But the usedMemory is higher than 
 memoryLimit.  Fetchers would keep waiting indefinitely until memory is 
 released.  InMemoryMerge will not kick in and not release memory.
 {code}
 In MergeManager, in memory merging is invoked under the following condition
 {code}
 if (!inMemoryMerger.isInProgress()  commitMemory = mergeThreshold)
 {code}
 Attaching the sample hive command just for reference
 

[jira] [Resolved] (TEZ-1573) Exception from InputInitializer and VertexManagerPlugin is not propogated to client

2015-01-08 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang resolved TEZ-1573.
-
Resolution: Won't Fix

Fixed in other jiras.

 Exception from InputInitializer and VertexManagerPlugin is not propogated to 
 client
 ---

 Key: TEZ-1573
 URL: https://issues.apache.org/jira/browse/TEZ-1573
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Jeff Zhang
Assignee: Jeff Zhang





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-08 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created TEZ-1929:
-

 Summary: AM intermittently sending kill signal to running task in 
heartbeat
 Key: TEZ-1929
 URL: https://issues.apache.org/jira/browse/TEZ-1929
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan


Observed this behavior 3 or 4 times

- Ran a hive query with tez (query_17 at 10 TB scale)
- Occasionally, Map_7 task will get into failed state in the middle of fetching 
data from other sources (only one task is available in Map_7).  

{code}
2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
Completed fetch for attempt: InputAttemptIdentifier 
[inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, TimeTaken=5, 
Rate=1.29 MB/s
2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
inputs fetched for input vertex : Map 6
2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
Shutting down FetchScheduler, Was Interrupted: false
2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
Scheduler thread completed
2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Received 
should die response from AM
2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked to 
die via task heartbeat
2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
waiting for task to complete. Interrupting task
2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
returning
2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie notification 
via hearbeats. Shutting down
2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at 
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
at 
org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
at 
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
{code}

From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
is unable to identify the containerId from registeredContainers.  Need to 
verify this.

I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1274) Remove Key/Value type checks in IFile

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1274:
--
Attachment: TEZ-1274.1.patch

[~sseth] Please review when you find time.

Removed unwanted checks.  It is left to the caller to ensure that proper type 
checks and length checks are done (since its all in tez code, it should be fine 
removing these checks).  We also don't want the length checks (if negative 
key/value lengths are passed, outputstream would anyways throw 
IndexOutOfBoundsException).

 Remove Key/Value type checks in IFile
 -

 Key: TEZ-1274
 URL: https://issues.apache.org/jira/browse/TEZ-1274
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth
Assignee: Rajesh Balamohan
 Attachments: TEZ-1274.1.patch


 We check key and value types for each record - this should be removed from 
 the tight loop. Maybe an assertion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-08 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-1929:
--
Attachment: tasklog.txt
Screen Shot 2015-01-08 at 2.09.11 PM.png
Screen Shot 2015-01-08 at 2.28.04 PM.png

 AM intermittently sending kill signal to running task in heartbeat
 --

 Key: TEZ-1929
 URL: https://issues.apache.org/jira/browse/TEZ-1929
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
 2015-01-08 at 2.28.04 PM.png, tasklog.txt


 Observed this behavior 3 or 4 times
 - Ran a hive query with tez (query_17 at 10 TB scale)
 - Occasionally, Map_7 task will get into failed state in the middle of 
 fetching data from other sources (only one task is available in Map_7).  
 {code}
 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
 Completed fetch for attempt: InputAttemptIdentifier 
 [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
 pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
 CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
 TimeTaken=5, Rate=1.29 MB/s
 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
 inputs fetched for input vertex : Map 6
 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
 Shutting down FetchScheduler, Was Interrupted: false
 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
 Scheduler thread completed
 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
 Received should die response from AM
 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
 to die via task heartbeat
 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
 waiting for task to complete. Interrupting task
 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
   at 
 org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
   at 
 org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
   at 
 org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 {code}
 From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
 is unable to identify the containerId from registeredContainers.  Need to 
 verify this.
 I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1904) Fix findbugs warnings in tez-runtime-library

2015-01-08 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269485#comment-14269485
 ] 

Rajesh Balamohan commented on TEZ-1904:
---

Minor comments:
- In pipelinedSorter, comparator isn't used in SpanMerger's constructor.  
Should we remove it?
- In SecureShuffleUtils, toHex() is not used anywhere.  Should we remove it, if 
relevant?
- IFileInputStream.getChecksum() is not used anywhere. Should we remove it, if 
relevant?


 Fix findbugs warnings in tez-runtime-library
 

 Key: TEZ-1904
 URL: https://issues.apache.org/jira/browse/TEZ-1904
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Siddharth Seth
 Attachments: TEZ-1904.1.txt


 https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library

2015-01-08 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1904:

Attachment: TEZ-1904.2.txt

Thanks for taking a look. Updated patch with comments addressed.

 Fix findbugs warnings in tez-runtime-library
 

 Key: TEZ-1904
 URL: https://issues.apache.org/jira/browse/TEZ-1904
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Siddharth Seth
 Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt


 https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-1904) Fix findbugs warnings in tez-runtime-library

2015-01-08 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-1904:

Attachment: TEZ-1904.3.txt

Updated to fix some more inconsistent_sync warnings.

 Fix findbugs warnings in tez-runtime-library
 

 Key: TEZ-1904
 URL: https://issues.apache.org/jira/browse/TEZ-1904
 Project: Apache Tez
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Siddharth Seth
 Attachments: TEZ-1904.1.txt, TEZ-1904.2.txt, TEZ-1904.3.txt


 https://builds.apache.org/job/PreCommit-Tez-Build/8/artifact/patchprocess/newPatchFindbugsWarningstez-runtime-library.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-1923) FetcherOrderedGrouped gets into infinite loop due to memory pressure

2015-01-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269939#comment-14269939
 ] 

Siddharth Seth commented on TEZ-1923:
-

{code}
+  if (usedMemory  memoryLimit) {
+LOG.info(Starting inMemoryMerger's merge since usedMemory= +
+memoryLimit +   memoryLimit= + memoryLimit +
+. commitMemory= + commitMemory + , mergeThreshold= + 
mergeThreshold);
+startMemToDiskMerge();
+  }
{code}
This will, at best, attempt to start the memToDiskMerger - there's no guarantee 
that it'll actually run since one may already be in progress. It ends up not 
waiting for the MemToMemMerger to complete - which would free up some memory - 
and potentially trigger another merge based on thresholds. The usedMemory at 
this point will be determined by a race between the current thread and the 
memtomemmerge thread (whether the unconditional reserve has been done yet or 
not).  Meanwhile, Fetchers block in any case - since memory isn't available. I 
think it's better to leave this section of the patch out - to be fixed in the 
MemToMem merger jiras.

{code}
+  if ((usedMemory + mergeOutputSize)  memoryLimit) {
+LOG.info(Not enough memory to carry out mem-to-mem merging. 
usedMemory= + usedMemory +
+  memoryLimit= + memoryLimit);
+return;
+  }
{code}
usedMemory may not be visible correctly - since it isn't inside the main 
MergeManager lock. This could also be part of the MemToMemMerger fixes.

{code}merger.waitForShuffleToMergeMemory();{code}
Would this be a problem in terms of connection timeouts - since this wait is 
while the connection is established. IThis could be in the run() method similar 
to merger.waitForInMemoryMerge() instead.

 FetcherOrderedGrouped gets into infinite loop due to memory pressure
 

 Key: TEZ-1923
 URL: https://issues.apache.org/jira/browse/TEZ-1923
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
Assignee: Rajesh Balamohan
 Attachments: TEZ-1923.1.patch, TEZ-1923.2.patch


 - Ran a comparatively large job (temp table creation) at 10 TB scale.
 - Turned on intermediate mem-to-mem 
 (tez.runtime.shuffle.memory-to-memory.enable=true and 
 tez.runtime.shuffle.memory-to-memory.segments=4)
 - Some reducers get lots of data and quickly gets into infinite loop
 {code}
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 3ms
 2015-01-07 02:36:56,644 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 1ms
 2015-01-07 02:36:56,645 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,647 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 2ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 2015-01-07 02:36:56,653 INFO [fetcher [Map_1] #2] 
 orderedgrouped.ShuffleScheduler: m1:13562 freed by fetcher [Map_1] #2 in 5ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] shuffle.HttpConnection: for 
 url=http://m1:13562/mapOutput?job=job_142126204_0201reduce=34map=attempt_142126204_0201_1_00_000420_0_10027keepAlive=true
  sent hash and receievd reply 0 ms
 2015-01-07 02:36:56,654 INFO [fetcher [Map_1] #2] 
 orderedgrouped.FetcherOrderedGrouped: fetcher#2 - MergerManager returned 
 Status.WAIT ...
 {code}
 Additional debug/patch statements revealed that InMemoryMerge is not invoked 
 appropriately and not releasing the memory back for fetchers to proceed. e.g 
 debug/patch messages are 

[jira] [Commented] (TEZ-1929) AM intermittently sending kill signal to running task in heartbeat

2015-01-08 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269943#comment-14269943
 ] 

Siddharth Seth commented on TEZ-1929:
-

[~rajesh.balamohan] - do you have the AM logs as well ? This could be a result 
of pre-emption - either because the wrong task is running, or because YARN 
decided the application is over it's resource limit.

 AM intermittently sending kill signal to running task in heartbeat
 --

 Key: TEZ-1929
 URL: https://issues.apache.org/jira/browse/TEZ-1929
 Project: Apache Tez
  Issue Type: Bug
Reporter: Rajesh Balamohan
 Attachments: Screen Shot 2015-01-08 at 2.09.11 PM.png, Screen Shot 
 2015-01-08 at 2.28.04 PM.png, tasklog.txt


 Observed this behavior 3 or 4 times
 - Ran a hive query with tez (query_17 at 10 TB scale)
 - Occasionally, Map_7 task will get into failed state in the middle of 
 fetching data from other sources (only one task is available in Map_7).  
 {code}
 2015-01-08 00:19:10,289 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: 
 Completed fetch for attempt: InputAttemptIdentifier 
 [inputIdentifier=InputIdentifier [inputIndex=0], attemptNumber=0, 
 pathComponent=attempt_142126204_0233_1_06_00_0_10003] to MEMORY, 
 CompressedSize=6757, DecompressedSize=16490,EndTime=1420705150289, 
 TimeTaken=5, Rate=1.29 MB/s
 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: All 
 inputs fetched for input vertex : Map 6
 2015-01-08 00:19:10,290 INFO [Fetcher [Map_6] #0] impl.ShuffleManager: copy(0 
 of 1. Transfer rate (CumulativeDataFetched/TimeSinceInputStarted)) 0.01 MB/s)
 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
 Shutting down FetchScheduler, Was Interrupted: false
 2015-01-08 00:19:10,290 INFO [ShuffleRunner [Map_6]] impl.ShuffleManager: 
 Scheduler thread completed
 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: 
 Received should die response from AM
 2015-01-08 00:19:41,986 INFO [TaskHeartbeatThread] task.TaskReporter: Asked 
 to die via task heartbeat
 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Interrupted while 
 waiting for task to complete. Interrupting task
 2015-01-08 00:19:41,987 INFO [main] task.TezTaskRunner: Shutdown requested... 
 returning
 2015-01-08 00:19:41,987 INFO [main] task.TezChild: Got a shouldDie 
 notification via hearbeats. Shutting down
 2015-01-08 00:19:41,990 ERROR [TezChild] tez.TezProcessor: 
 java.lang.InterruptedException
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
   at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
   at 
 org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:120)
   at 
 org.apache.tez.runtime.InputReadyTracker.waitForAnyInputReady(InputReadyTracker.java:83)
   at 
 org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAnyInputReady(TezProcessorContextImpl.java:106)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:153)
   at 
 org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:138)
   at 
 org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:328)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
   at 
 org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
 {code}
 From the initial look, it appears that TaskAttemptListenerImpTezDag.heartbeat 
 is unable to identify the containerId from registeredContainers.  Need to 
 verify this.
 I will attach the sample task log and the tez-ui details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)