[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2017-10-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195123#comment-16195123
 ] 

Hudson commented on HBASE-12590:


Results for branch HBASE-18467, done in 4 hr 24 min and counting
[build #136 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136/]: 
FAILURE

details (if available):

(x) *{color:red}-1 overall{color}*
Committer, please check your recent inclusion of a patch for this issue.

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136//General_Nightly_Build_Report/]










(/) {color:green}+1 jdk8 checks{color}
-- For more information [see jdk8 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-18467/136//JDK8_Nightly_Build_Report/]


(x) {color:red}-1 source release artifact{color}
-- See build output for details.



> A solution for data skew in HBase-Mapreduce Job
> ---
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Weichen Ye
>Assignee: Weichen Ye
> Fix For: 2.0.0
>
> Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, 
> HBASE-12590-v3.patch, HBASE-12590-v4.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2017-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190787#comment-16190787
 ] 

Hudson commented on HBASE-12590:


SUCCESS: Integrated in Jenkins build HBase-Trunk_matrix #3823 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/3823/])
HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 
(apurtell: rev 16d483f9003ddee71404f37ce7694003d1a18ac4)
* (edit) 
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java


> A solution for data skew in HBase-Mapreduce Job
> ---
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Weichen Ye
>Assignee: Weichen Ye
> Fix For: 2.0.0
>
> Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, 
> HBASE-12590-v3.patch, HBASE-12590-v4.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2017-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190768#comment-16190768
 ] 

Hudson commented on HBASE-12590:


SUCCESS: Integrated in Jenkins build HBase-1.4 #940 (See 
[https://builds.apache.org/job/HBase-1.4/940/])
HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 
(apurtell: rev cbbcb2db2f0a94382cb33fef826cbf1a00b5de6e)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java


> A solution for data skew in HBase-Mapreduce Job
> ---
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Weichen Ye
>Assignee: Weichen Ye
> Fix For: 2.0.0
>
> Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, 
> HBASE-12590-v3.patch, HBASE-12590-v4.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2017-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190744#comment-16190744
 ] 

Hudson commented on HBASE-12590:


FAILURE: Integrated in Jenkins build HBase-1.5 #84 (See 
[https://builds.apache.org/job/HBase-1.5/84/])
HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 
(apurtell: rev fc783ef04505eab7e58c6abc3ac1f7d7ecce465b)
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/namespace/TestNamespaceAuditor.java
* (edit) 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java


> A solution for data skew in HBase-Mapreduce Job
> ---
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Weichen Ye
>Assignee: Weichen Ye
> Fix For: 2.0.0
>
> Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, 
> HBASE-12590-v3.patch, HBASE-12590-v4.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2017-10-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190661#comment-16190661
 ] 

Hudson commented on HBASE-12590:


FAILURE: Integrated in Jenkins build HBase-2.0 #622 (See 
[https://builds.apache.org/job/HBase-2.0/622/])
HBASE-16894 Create more than 1 split per region, generalize HBASE-12590 
(apurtell: rev 4475ba88c15886bd15c113f2dbd5214600686cfe)
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* (edit) 
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java


> A solution for data skew in HBase-Mapreduce Job
> ---
>
> Key: HBASE-12590
> URL: https://issues.apache.org/jira/browse/HBASE-12590
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Weichen Ye
>Assignee: Weichen Ye
> Fix For: 2.0.0
>
> Attachments: A Solution for Data Skew in HBase-MapReduce Job 
> (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
> (Version3).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch, 
> HBASE-12590-v3.patch, HBASE-12590-v4.patch
>
>
> 1, Motivation
> In production environment, data skew is a very common case. A HBase table may 
> contains a lot of small regions and several large regions. Small regions 
> waste a lot of computing resources. If we use a job to scan a table with 3000 
> small regions, we need a job with 3000 mappers. Large regions always block 
> the job. If in a 100-region table, one region is far large then the other 99 
> regions. When we run a job with the table as input, 99 mappers will be 
> completed very quickly, and then we need to wait for the last mapper for a 
> long time.
> 2, Configuration
> Add three new configuration 
> hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
> HBase-MapReduce jobs. The default value is false. 
> hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
> size is larger than 3x average region size, treat the region as 
> “proportionately too large”.
> hbase.table.row.textkey  = true means the row key is text. False means binary 
> row key. It is used to find the mid row key in large region. The default 
> value is true. 
> If (region size >= average size*ratio) :  cut the region into two MR input 
> splits
> If (average size <= region size < average size*ratio) : one region as one MR 
> input split
> If (sum of several continuous regions size < average size): combine these 
> regions into one MR input split.
> Example:
> In attachment
> Welcome to the Review Board.
> https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2015-03-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356365#comment-14356365
 ] 

Hudson commented on HBASE-12590:


FAILURE: Integrated in HBase-0.98 #890 (See 
[https://builds.apache.org/job/HBase-0.98/890/])
HBASE-13168 Backport HBASE-12590 A solution for data skew in HBase-Mapreduce 
Job (tedyu: rev 1b4f8afaec8cd4dfef46154bdceb31ce7ddf5982)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2015-03-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356415#comment-14356415
 ] 

Hudson commented on HBASE-12590:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #847 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/847/])
HBASE-13168 Backport HBASE-12590 A solution for data skew in HBase-Mapreduce 
Job (tedyu: rev 1b4f8afaec8cd4dfef46154bdceb31ce7ddf5982)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2015-03-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356236#comment-14356236
 ] 

Hudson commented on HBASE-12590:


FAILURE: Integrated in HBase-1.0 #795 (See 
[https://builds.apache.org/job/HBase-1.0/795/])
HBASE-13168 Backport HBASE-12590 A solution for data skew in HBase-Mapreduce 
Job (tedyu: rev 89112e84957558f31c161256aa2d7054f165ca02)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2015-03-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14356251#comment-14356251
 ] 

Hudson commented on HBASE-12590:


SUCCESS: Integrated in HBase-1.1 #276 (See 
[https://builds.apache.org/job/HBase-1.1/276/])
HBASE-13168 Backport HBASE-12590 A solution for data skew in HBase-Mapreduce 
Job (tedyu: rev 05aef46d942a0196c6c655ab19a160cd7dc56789)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-24 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258216#comment-14258216
 ] 

Jonathan Hsieh commented on HBASE-12590:


nice catches.  It would be nice to port a correct algorithm over into this 
places.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-24 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258224#comment-14258224
 ] 

Jonathan Hsieh commented on HBASE-12590:


Thanks [~yeweichen]!

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258282#comment-14258282
 ] 

Hudson commented on HBASE-12590:


SUCCESS: Integrated in HBase-TRUNK #5965 (See 
[https://builds.apache.org/job/HBase-TRUNK/5965/])
HBASE-12590 A solution for data skew in HBase-Mapreduce jobs (Weichen Ye) 
(jmhsieh: rev a912a56b38fca6aada68dab5ef73613c073cbc6a)
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScan1.java
* 
hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableInputFormatScanBase.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-24 Thread Weichen Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258610#comment-14258610
 ] 

Weichen Ye commented on HBASE-12590:


Thank you [~jmhsieh] for your help and comments! I`ll continue working on 
HBASE-12716.

And Merry Christmas:)

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
Assignee: Weichen Ye
 Fix For: 2.0.0

 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-18 Thread Weichen Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14251686#comment-14251686
 ] 

Weichen Ye commented on HBASE-12590:


[~j...@cloudera.com]
Hi~
I used to try the algorithm in RegionSplitter, but I find there is a small bug. 
If the start key is the same length as the end key, and their last bytes are 
adjacent in alphabetical order , the algorithm would not calculate a split 
point with an additional byte.

This split algorithm is not very related to the data skew in HBase-MapReduce 
job, so i create two new issues about it .
https://issues.apache.org/jira/browse/HBASE-12716
https://issues.apache.org/jira/browse/HBASE-12717


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249605#comment-14249605
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12687680/HBASE-12590-v4.patch
  against master branch at commit 99a11390b4758c211af04af2ca0696ac6e3e0aeb.
  ATTACHMENT ID: 12687680

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
2086 checkstyle errors (more than the master's current 2084 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12105//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce 

[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249751#comment-14249751
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12687706/HBASE-12590-v4.patch
  against master branch at commit 99a11390b4758c211af04af2ca0696ac6e3e0aeb.
  ATTACHMENT ID: 12687706

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
2086 checkstyle errors (more than the master's current 2084 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12107//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 

[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-17 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14250659#comment-14250659
 ] 

Jonathan Hsieh commented on HBASE-12590:


FYI, while working in other code I found this which handles the Uniform region 
split case.  Might make sense to use fold in the ascii splitter into that form 
and use this existing and long tested code path.

https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/RegionSplitter.java#L1032

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBASE-12590-v4.patch, 
 HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table may 
 contains a lot of small regions and several large regions. Small regions 
 waste a lot of computing resources. If we use a job to scan a table with 3000 
 small regions, we need a job with 3000 mappers. Large regions always block 
 the job. If in a 100-region table, one region is far large then the other 99 
 regions. When we run a job with the table as input, 99 mappers will be 
 completed very quickly, and then we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add three new configuration 
 hbase.mapreduce.input.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.input.autobalance.maxskewratio= 3 (default is 3). If a region 
 size is larger than 3x average region size, treat the region as 
 “proportionately too large”.
 hbase.table.row.textkey  = true means the row key is text. False means binary 
 row key. It is used to find the mid row key in large region. The default 
 value is true. 
 If (region size = average size*ratio) :  cut the region into two MR input 
 splits
 If (average size = region size  average size*ratio) : one region as one MR 
 input split
 If (sum of several continuous regions size  average size): combine these 
 regions into one MR input split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-16 Thread Weichen Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248199#comment-14248199
 ] 

Weichen Ye commented on HBASE-12590:


Latest diff on review board: https://reviews.apache.org/r/28494/diff/


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, 
 HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248276#comment-14248276
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12687480/HBASE-12590-v3.patch
  against master branch at commit 96c6b9815ddbc9f2589655df4ad2381af04ac9f8.
  ATTACHMENT ID: 12687480

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:red}-1 checkstyle{color}.  The applied patch generated 
2091 checkstyle errors (more than the master's current 2089 errors).

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//artifact/patchprocess/checkstyle-aggregate.html

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/12096//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, A Solution for Data Skew in HBase-MapReduce Job 
 (Version3).pdf, HBASE-12590-v3.patch, HBase-12590-v1.patch, 
 HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target 

[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232972#comment-14232972
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12684872/HBase-12590-v2.patch
  against master branch at commit 13a1eaec09a467153adc1ee0b46df9f457da6115.
  ATTACHMENT ID: 12684872

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 

 {color:red}-1 core zombie tests{color}.  There are 1 zombie test(s):   
at 
org.apache.hadoop.hbase.master.balancer.TestDefaultLoadBalancer.testBalanceCluster(TestDefaultLoadBalancer.java:119)

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11927//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 

[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-03 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233278#comment-14233278
 ] 

Jonathan Hsieh commented on HBASE-12590:


{quote}
2) It is a difficult issue in this patch. It is hard (~for me) to split a large 
region into several small MR input splits with target size ( we have only 
start rowkey, end rowkey and the Region size). So my point is just find a 
mid rowkey between start rowkey and end rowkey. Do you have any ideas 
about this? For instance if we split a 5GB region into five 1GB MR input 
splits, how to find the split point(rowkey) to make the size of these MR input 
splits equal to 1GB?
{quote}

internally the split operation tries to read the cell closest to the the mid 
point of the hfiles and doesn't make rowkey distribution assumptions[1,2,3].   
These values however are not exposed  for the MR format to use.  In v1 and v2 
here calculates a split point assuming an ascii-centric, uniformly distribution 
of rowkeys in the inputsplit.  You should at least note that in the docs.  
Since you are generating the split point based on the uniform distribution 
assumption, you can probably actually relatively easily calculate more split 
points.

thanks for posting on review board, I've added more comments there.

[1] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java#L6023
[2] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.java#L67
[3] 
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java#L670

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-03 Thread Weichen Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14233780#comment-14233780
 ] 

Weichen Ye commented on HBASE-12590:


[~jmhsieh] Thank you your review! It really help me a lot! I`ll continue 
improving the patch based on your comments.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job 
 (Version2).pdf, HBase-12590-v1.patch, HBase-12590-v2.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-01 Thread Jonathan Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230024#comment-14230024
 ] 

Jonathan Hsieh commented on HBASE-12590:


Nice description of the problem in the slide deck.  I did a quick scan of the 
docs and the code and had a few questions.

1) The world split is ambiguous.  Need to make it clear in java doc that this 
is only a MR input split and not an hbase region split operation that would 
trigger a lot io.
2) Why do we only split by 2? Why not split further so that we have n mr input 
splits that are 1gb (in your example) instead of a 2x 3gb, 2x 2.5gb and a 2x 
1gb artificial mr input splits?
3) To make this easier for users, do you think it might may sense to use 
something other than a constant size (which assumes the user knows the the 
server side region size property)? can we look at all of the regions sizes (we 
have the info already with the RegionSizeCalculator), and just add new MR 
inputsplits for the regions that are proportionately too large?  Maybe we have 
the setting be a ratio (maybe 5x-10x) larger than the median median region 
size?  That way the job won't have to change if the server side setting changes.

 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
 HBase-12590-v1.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-12-01 Thread Weichen Ye (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230866#comment-14230866
 ] 

Weichen Ye commented on HBASE-12590:


[~jmhsieh] Thank you for your review and your advice! 

1) The word split may be confusing or misleading here. I`ll change the code 
and doc about this.

2)  It is a difficult issue in this patch. It is hard (~for me) to split a 
large region into several small MR input splits with target size ( we have 
only start rowkey, end rowkey and the Region size). So my point is just 
find a mid rowkey between start rowkey and end rowkey. Do you have any 
ideas about this? For instance if we split a 5GB region into five 1GB MR input 
splits, how to find the split point(rowkey) to make the size of these MR input 
splits equal to 1GB?

3) You give me a great idea! I totally agree to set a ratio other than a 
constant size in configuration. This week I`ll making a new patch in this new 
way.  


 A solution for data skew in HBase-Mapreduce Job
 ---

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
 HBase-12590-v1.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-11-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227150#comment-14227150
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12683981/HBase-12590-v1.patch
  against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a.
  ATTACHMENT ID: 12683981

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:red}-1 javac{color}.  The patch appears to cause mvn compile goal to 
fail.

Compilation errors resume:
[ERROR] COMPILATION ERROR : 
[ERROR] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java:[45,48]
 cannot find symbol
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on 
project hbase-server: Compilation failure
[ERROR] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java:[45,48]
 cannot find symbol
[ERROR] symbol:   class HLog
[ERROR] location: package org.apache.hadoop.hbase.regionserver.wal
[ERROR] - [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn goals -rf :hbase-server


Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11848//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job 
 

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Affects Versions: 2.0.0
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
 HBase-12590-v1.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less than the target 
 size, combine these regions into one split.
 Example:
 In attachment
 Welcome to the Review Board.
 https://reviews.apache.org/r/28494/diff/#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12590) A solution for data skew in HBase-Mapreduce Job

2014-11-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227259#comment-14227259
 ] 

Hadoop QA commented on HBASE-12590:
---

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12683988/HBase-12590-v1.patch
  against master branch at commit f0d95e7f11403d67b4fc3f1fd4ef048047b6842a.
  ATTACHMENT ID: 12683988

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 6 new 
or modified tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 checkstyle{color}.  The applied patch does not increase the 
total number of checkstyle errors

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

  {color:green}+1 site{color}.  The mvn site goal succeeds with this patch.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-client.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-annotations.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-thrift.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/newPatchFindbugsWarningshbase-rest.html
Checkstyle Errors: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//artifact/patchprocess/checkstyle-aggregate.html

  Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/11849//console

This message is automatically generated.

 A solution for data skew in HBase-Mapreduce Job 
 

 Key: HBASE-12590
 URL: https://issues.apache.org/jira/browse/HBASE-12590
 Project: HBase
  Issue Type: Improvement
  Components: mapreduce
Reporter: Weichen Ye
 Attachments: A Solution for Data Skew in HBase-MapReduce Job.pdf, 
 HBase-12590-v1.patch


 1, Motivation
 In production environment, data skew is a very common case. A HBase table 
 always contains a lot of small regions and several large regions. Small 
 regions waste a lot of computing resources. If we use a job to scan a table 
 with 3000 small regions, we need a job with 3000 mappers. Large regions 
 always block the job. If in a 100-region table, one region is far larger then 
 the other 99 regions. When we run a job with the table as input, 99 mappers 
 will be completed very quickly, and we need to wait for the last mapper for a 
 long time.
 2, Configuration
 Add two new configuration. 
 hbase.mapreduce.split.autobalance = true means enabling the “auto balance” in 
 HBase-MapReduce jobs. The default value is false. 
 hbase.mapreduce.split.targetsize = 1073741824 (default 1GB). The target size 
 of mapreduce splits. 
 If a region size is large than the target size, cut the region into two 
 split.If the sum of several small continuous region size less