[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3291:
--
Attachment: TEZ-3291.4.patch

Attaching the patch which explicitly checks whether all splits having 
"localhost" (for s3). Added additional test case.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.4.patch, 
> TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce

2016-06-12 Thread TezQA (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326688#comment-15326688
 ] 

TezQA commented on TEZ-3297:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  
http://issues.apache.org/jira/secure/attachment/12809732/TEZ-3297.2.branch-0.7.patch
  against master revision 1d11ad2.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1793//console

This message is automatically generated.

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---
>
> Key: TEZ-3297
> URL: https://issues.apache.org/jira/browse/TEZ-3297
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Priority: Critical
> Attachments: TEZ-3297.1.patch, TEZ-3297.2.branch-0.7.patch, 
> TEZ-3297.2.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce

2016-06-12 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3297:
--
Attachment: TEZ-3297.2.branch-0.7.patch

Thanks a lot [~sseth], [~bikassaha]. Will commit it shortly.

[~jeagles] - Attaching patch for branch-0.7 as well. Will commit it to 
branch-0.7

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---
>
> Key: TEZ-3297
> URL: https://issues.apache.org/jira/browse/TEZ-3297
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Priority: Critical
> Attachments: TEZ-3297.1.patch, TEZ-3297.2.branch-0.7.patch, 
> TEZ-3297.2.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326689#comment-15326689
 ] 

Bikas Saha commented on TEZ-3291:
-

The comment could be more explicit like "this is a workaround for systems like 
S3 that pass the same fake hostname for all splits"
The log could log the newDesiredSplits and also the final value of desired 
splits such that we get all the info in one log.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce

2016-06-12 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan reassigned TEZ-3297:
-

Assignee: Rajesh Balamohan

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---
>
> Key: TEZ-3297
> URL: https://issues.apache.org/jira/browse/TEZ-3297
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Assignee: Rajesh Balamohan
>Priority: Critical
> Attachments: TEZ-3297.1.patch, TEZ-3297.2.branch-0.7.patch, 
> TEZ-3297.2.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3291:
--
Attachment: TEZ-3291.3.patch

Attaching patch to address review comments.

S3 urls can be explicitly checked in splits (by casting to fileSplit and 
checking getPath). But not sure, if we need to restrict this only for 
FileSplits in future.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-12 Thread Jonathan Eagles (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326738#comment-15326738
 ] 

Jonathan Eagles commented on TEZ-3296:
--

bq. Now - (1*24*3) + 20*3 = 150 = (2*24*3) + 2*3
The formula is set up so that all vertices with a distance of _h_ from the root 
have a logically higher priority than all vertices with a distance of _h + 1_ . 
In the example above, the calculation on the LHS should be 132.


> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3216) Support for more precise partition stats in VertexManagerEvent

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326650#comment-15326650
 ] 

Bikas Saha commented on TEZ-3216:
-

/cc [~rajesh.balamohan] in case he is interested in this optimization.

> Support for more precise partition stats in VertexManagerEvent
> --
>
> Key: TEZ-3216
> URL: https://issues.apache.org/jira/browse/TEZ-3216
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: TEZ-3216.patch
>
>
> Follow up on TEZ-3206 discussion, at least for some use cases, more accurate 
> partition stats will be useful for DataMovementEvent routing. Maybe we can 
> provide a config option to allow apps to choose the more accurate partition 
> stats over RoaringBitmap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3297) Deadlock scenario in AM during ShuffleVertexManager auto reduce

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326660#comment-15326660
 ] 

Bikas Saha commented on TEZ-3297:
-

looking at the code further, looks like the crucial change is not holding own 
vertex lock while trying to read src/dest vertex lock. that makes sense and 
seems like a lock ordering issue waiting to happen. Perhaps a quick scan of 
such nested locking is in order in case not already done.

The removal of the overall lock is fine since each internal method invocation 
like getTotalTasks() are already handling their own locking. 

lgtm.

Moving VM invoked sync calls onto the dispatcher is a good idea but would need 
the addition of new callbacks into the VM to notify them of completion of the 
requested vertex state change operation. Since most current VMs dont do much 
after changing parallelism, the change might be simpler to implement now. Not 
sure about Hive custom VMs.

> Deadlock scenario in AM during ShuffleVertexManager auto reduce
> ---
>
> Key: TEZ-3297
> URL: https://issues.apache.org/jira/browse/TEZ-3297
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Zhiyuan Yang
>Priority: Critical
> Attachments: TEZ-3297.1.patch, TEZ-3297.2.patch, am_log, thread_dump
>
>
> Here is what's happening in the attached thread dump.
> App Pool thread #9 does the auto reduce on V2 and initializes the new edge 
> manager, it holds the V2 write lock and wants read lock of source vertex V1. 
> At the same time, another App Pool thread #2 schedules a task of V1 and gets 
> the output spec, so it holds the V1 read lock and wants V2 read lock. 
> Also, dispatcher thread wants the V1 write lock to begin the state machine 
> transition. Since dispatcher thread is at the head of V1 ReadWriteLock queue, 
> thread #9 cannot get V1 read lock even thread #2 is holding V1 read lock. 
> This is a circular lock scenario. #2 blocks dispatcher, dispatcher blocks #9, 
> and #9 blocks #2.
> There is no problem with ReadWriteLock behavior in this case. Please see this 
> java bug report, http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6816565.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3302) Add a version of processorContext.waitForAllInputsReady and waitForAnyInputReady with a timeout

2016-06-12 Thread Siddharth Seth (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated TEZ-3302:

Description: This is useful when a Processor needs to check on whether it 
has been aborted or not, and the interrupt that is sent in as part of the 'Task 
kill' process has been swallowed by some other entity.

> Add a version of processorContext.waitForAllInputsReady and 
> waitForAnyInputReady with a timeout
> ---
>
> Key: TEZ-3302
> URL: https://issues.apache.org/jira/browse/TEZ-3302
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Siddharth Seth
>
> This is useful when a Processor needs to check on whether it has been aborted 
> or not, and the interrupt that is sent in as part of the 'Task kill' process 
> has been swallowed by some other entity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TEZ-3302) Add a version of processorContext.waitForAllInputsReady and waitForAnyInputReady with a timeout

2016-06-12 Thread Siddharth Seth (JIRA)
Siddharth Seth created TEZ-3302:
---

 Summary: Add a version of processorContext.waitForAllInputsReady 
and waitForAnyInputReady with a timeout
 Key: TEZ-3302
 URL: https://issues.apache.org/jira/browse/TEZ-3302
 Project: Apache Tez
  Issue Type: Improvement
Reporter: Siddharth Seth






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3300) Tez UI: A wiki must be created with info about each page in Tez UI

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326638#comment-15326638
 ] 

Bikas Saha commented on TEZ-3300:
-

Could pages to the wiki be linked directly from the UI page for quick access?

> Tez UI: A wiki must be created with info about each page in Tez UI
> --
>
> Key: TEZ-3300
> URL: https://issues.apache.org/jira/browse/TEZ-3300
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sreenath Somarajapuram
>
> - It would be a page under Tez confluence
> - Must be flexible enough to support different versions of Tez UI, and give 
> context based help.
> - Add a section on understanding various errors displayed in the error-bar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-3300) Tez UI: A wiki must be created with info about each page in Tez UI

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326638#comment-15326638
 ] 

Bikas Saha edited comment on TEZ-3300 at 6/12/16 9:22 PM:
--

Could pages to the wiki be linked directly from the corresponding UI pages for 
quick access?


was (Author: bikassaha):
Could pages to the wiki be linked directly from the UI page for quick access?

> Tez UI: A wiki must be created with info about each page in Tez UI
> --
>
> Key: TEZ-3300
> URL: https://issues.apache.org/jira/browse/TEZ-3300
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Sreenath Somarajapuram
>
> - It would be a page under Tez confluence
> - Must be flexible enough to support different versions of Tez UI, and give 
> context based help.
> - Add a section on understanding various errors displayed in the error-bar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3296) Tez job can hang if two vertices at the same root distance have different task requirements

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326673#comment-15326673
 ] 

Bikas Saha commented on TEZ-3296:
-

bq. Today each vertex uses a set of three priority values, the low, the high, 
and the mean of those two. (Oddly containers for high are never requested in 
practice, just the low and mean.)
The middle priority is default. The lower value (higher pri) is for failed task 
reruns. The higher value (lower pri) was intended was speculative tasks but may 
have been missed being used for that.

Wondering why the app was hung. IIRC YARN keeps the higher resource request 
when there are multiple at the same priority because thats the safer thing to 
do. So when 2 vertices have the same priority but different resources then we 
would expect to get containers for both but with the higher resource value 
across the board.
If the above is correct then perhaps there is a bug in the task scheduler code 
that needs to get fixed which we might miss if we change the vertex priorities 
to be unique as a workaround. The vertex priority change is good in its own 
right. But would be good to make sure we dont have some pending bug in the task 
scheduler that may have other side effects. Could you please attach the task 
scheduler log for the job that hung in case that has some clues.

On the patch itself the formula looks like
(Height*Total*3) + V*3.
Now - (1*24*3) + 20*3 = 150 = (2*24*3) + 2*3
So we could still have collisions depending on the manner in which vertexIds 
get assigned, right? Unless currently we are getting lucky in the vId 
assignment such that vertices close to the root also happen to get low ids.


> Tez job can hang if two vertices at the same root distance have different 
> task requirements
> ---
>
> Key: TEZ-3296
> URL: https://issues.apache.org/jira/browse/TEZ-3296
> Project: Apache Tez
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Jason Lowe
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: TEZ-3296.001.patch
>
>
> When two vertices have the same distance from the root Tez will schedule 
> containers with the same priority.  However those vertices could have 
> different task requirements and therefore different capabilities.  As 
> documented in YARN-314, YARN currently doesn't support requests for multiple 
> sizes at the same priority.  In practice this leads to one vertex allocation 
> requests clobbering the other, and that can result in a situation where the 
> Tez AM is waiting on containers it will never receive from the RM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326649#comment-15326649
 ] 

Bikas Saha commented on TEZ-3291:
-

Why the numLoc=1 check only in the size < min case?

A comment before the code, explaining the above workaround would be useful. 
Also a log statement.

This may affect single node cases because numLoc=1 in that case too. Is there 
any way we can find out if the splits are coming from an S3 like source and use 
that information instead. E.g. something similar to splitSizeEstimator that can 
look at the split and return if its locations are potentially fake.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-3291:
--
Attachment: TEZ-3291.2.patch

Thanks for the review [~bikassaha].  Attaching the revised patch.

It is hard to find out if it is coming from localhost or from S3. Number of 
nodes in the cluster could serve as a hint (to avoid single node cluster), but 
that info would not be available in split grouper.

When {{lengthPerGroup > maxLengthPerGroup}}, it goes via the normal code path 
and gets more splits. It is also possible that couple of groups can have more 
number of splits than others. But having a max number of splits that can be 
accommodated in a group when "tez.grouping.by-length" is turned on would be 
tricky. 



> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: TEZ-3297 PreCommit Build #1793

2016-06-12 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/TEZ-3297
Build: https://builds.apache.org/job/PreCommit-TEZ-Build/1793/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 24 lines...]

==
==
Testing patch for TEZ-3297.
==
==


HEAD is now at 1d11ad2 TEZ-3296. Tez fails to compile against hadoop 2.8 after 
MAPREDUCE-5870 (jeagles)
Previous HEAD position was 1d11ad2... TEZ-3296. Tez fails to compile against 
hadoop 2.8 after MAPREDUCE-5870 (jeagles)
Switched to branch 'master'
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)
First, rewinding head to replay your work on top of it...
Fast-forwarded master to 1d11ad275548031c68b2b360f2b8b7111ecd91fd.
TEZ-3297 patch is being downloaded at Sun Jun 12 23:38:50 UTC 2016 from
http://issues.apache.org/jira/secure/attachment/12809732/TEZ-3297.2.branch-0.7.patch
The patch does not appear to apply with p0 to p2
PATCH APPLICATION FAILED




{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment
  
http://issues.apache.org/jira/secure/attachment/12809732/TEZ-3297.2.branch-0.7.patch
  against master revision 1d11ad2.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1793//console

This message is automatically generated.


==
==
Adding comment to Jira.
==
==


Comment added.
f15871f38d47b02ff0ce71f1d18d1873987d847a logged out


==
==
Finished build.
==
==


Build step 'Execute shell' marked build as failure
Archiving artifacts
[description-setter] Description set: MAPREDUCE-5870
Recording test results
ERROR: Step ?Publish JUnit test result report? failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
No tests ran.

[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326684#comment-15326684
 ] 

Bikas Saha commented on TEZ-3291:
-

Would the split not have the URLs with S3 in them? Wondering how ORC split 
estimator works? If it cases the spit into ORCSplit and inspects internal 
members then perhaps the S3 split could also be cast into the correct object to 
look at the URLs?

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326714#comment-15326714
 ] 

Gopal V commented on TEZ-3291:
--

The numDistinctLocations worries me since this impl leaks into HDFS runs as 
well.

S3 and WASB return "localhost" for the hostnames (causing much damage with YARN 
container allocation), while all other impls which provide actual locality 
information instead of providing a dummy entry - in particular, using the 
actual "127.0.0.1" IP address instead of using hostnames.

The text entry of "localhost" could be special-cased, so that this change 
cannot impact HDFS installs.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.2.patch, TEZ-3291.3.patch, TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TEZ-3291) Optimize splits grouping when locality information is not available

2016-06-12 Thread Rajesh Balamohan (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326302#comment-15326302
 ] 

Rajesh Balamohan commented on TEZ-3291:
---

It is ready for review [~bikassaha]. haven't renamed the patch.

> Optimize splits grouping when locality information is not available
> ---
>
> Key: TEZ-3291
> URL: https://issues.apache.org/jira/browse/TEZ-3291
> Project: Apache Tez
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Priority: Minor
> Attachments: TEZ-3291.WIP.patch
>
>
> There are scenarios where splits might not contain the location details. S3 
> is an example, where all splits would have "localhost" for the location 
> details. In such cases, curent split computation does not go through the 
> rack local and allow-small groups optimizations and ends up creating small 
> number of splits. Depending on clusters this can end creating long running 
> map jobs.
> Example with hive:
> ==
> 1. Inventory table in tpc-ds dataset is partitioned and is relatively a small 
> table.
> 2. With query-22, hive requests with the original splits count as 52 and 
> overall length of splits themselves is around 12061817 bytes. 
> {{tez.grouping.min-size}} was set to 16 MB.
> 3. In tez splits grouping, this ends up creating a single split with 52+ 
> files be processed in the split.  In clusters with split locations, this 
> would have landed up with multiple splits since {{allowSmallGroups}} would 
> have kicked in.
> But in S3, since everything would have "localhost" all splits get added to 
> single group. This makes things a lot worse.
> 4. Depending on the dataset and the format, this can be problematic. For 
> instance, file open calls and random seeks can be expensive in S3.
> 5. In this case, 52 files have to be opened and processed by single task in 
> sequential fashion. Had it been processed by multiple tasks, response time 
> would have drastically reduced.
> E.g log details
> {noformat}
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Grouping splits in Tez
> 2016-06-01 13:48:08,353 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired splits: 110 too large.  Desired 
> splitLength: 109652 Min splitLength: 16777216 New desired splits: 1 Total 
> length: 12061817 Original splits: 52
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Desired numSplits: 1 lengthPerGroup: 12061817 
> numLocations: 1 numSplitsPerLocation: 52 numSplitsInGroup: 52 totalLength: 
> 12061817 numOriginalSplits: 52 . Grouping by length: true count: false
> 2016-06-01 13:48:08,354 [INFO] [InputInitializer {Map 2} #0] 
> |split.TezMapredSplitsGrouper|: Number of splits desired: 1 created: 1 
> splitsProcessed: 52
> {noformat}
> Alternate options:
> ==
> 1. Force Hadoop to provide bogus locations for S3. But not sure, if that 
> would be accepted anytime soon. Ref: HADOOP-12878
> 2. Set {{tez.grouping.min-size}} to very very low value. But should the end 
> user always be doing this on query to query basis?
> 3. When {{(lengthPerGroup < "tez.grouping.min-size")}}, recompute 
> desiredNumSplits only when number of distinct locations in the splits is > 1. 
> This would force more number of splits to be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)