[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-02-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141148#comment-15141148
 ] 

Hudson commented on HBASE-15171:


FAILURE: Integrated in HBase-0.98-matrix #295 (See 
[https://builds.apache.org/job/HBase-0.98-matrix/295/])
HBASE-15171 Avoid counting duplicate kv and generating lots of small (apurtell: 
rev 38cd179bb540f0d38c5810a17097c5727947ca73)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java
HBASE-15171 Addendum removes extra loop (Yu Li) (apurtell: rev 
de149d0bc4eda960e7246c79a1ad85c9cbe50de0)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0, 0.98.18
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-02-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141224#comment-15141224
 ] 

Hudson commented on HBASE-15171:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1169 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1169/])
HBASE-15171 Avoid counting duplicate kv and generating lots of small (apurtell: 
rev 38cd179bb540f0d38c5810a17097c5727947ca73)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java
HBASE-15171 Addendum removes extra loop (Yu Li) (apurtell: rev 
de149d0bc4eda960e7246c79a1ad85c9cbe50de0)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0, 0.98.18
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122138#comment-15122138
 ] 

Hudson commented on HBASE-15171:


SUCCESS: Integrated in HBase-1.3-IT #466 (See 
[https://builds.apache.org/job/HBase-1.3-IT/466/])
HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 
dfa94841374f78422d4e44a5623cc8b601966b1d)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15121958#comment-15121958
 ] 

Hudson commented on HBASE-15171:


SUCCESS: Integrated in HBase-1.3 #519 (See 
[https://builds.apache.org/job/HBase-1.3/519/])
HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 
dfa94841374f78422d4e44a5623cc8b601966b1d)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15122196#comment-15122196
 ] 

Hudson commented on HBASE-15171:


FAILURE: Integrated in HBase-Trunk_matrix #665 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/665/])
HBASE-15171 Addendum removes extra loop (Yu Li) (tedyu: rev 
37ed0f6d0815389e0b368bc98b3a01dd02f193ac)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.addendum.patch, HBASE-15171.patch, 
> HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120988#comment-15120988
 ] 

Hadoop QA commented on HBASE-15171:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
39s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} master passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} master passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 3m 
57s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
17s {color} | {color:green} master passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 52s 
{color} | {color:red} hbase-server in master has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} master passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} master passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
46s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 4m 
27s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
24m 59s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 93m 18s {color} 
| {color:red} hbase-server in the patch failed with JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 88m 38s 
{color} | {color:green} hbase-server in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
15s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 228m 47s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.hbase.replication.multiwal.TestReplicationSyncUpToolWithMultipleWAL |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.9.1 Server=1.9.1 Image:yetus/hbase:date2016-01-28 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12784815/HBASE-15171.addendum.patch
 |
| JIRA Issue | HBASE-15171 |
| Optional 

[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120292#comment-15120292
 ] 

Hudson commented on HBASE-15171:


FAILURE: Integrated in HBase-1.3 #517 (See 
[https://builds.apache.org/job/HBase-1.3/517/])
HBASE-15171 Avoid counting duplicate kv and generating lots of small (tedyu: 
rev 630ad95c923f642d006274b9b1a14397a6713412)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120442#comment-15120442
 ] 

Hudson commented on HBASE-15171:


FAILURE: Integrated in HBase-Trunk_matrix #663 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/663/])
HBASE-15171 Avoid counting duplicate kv and generating lots of small (tedyu: 
rev 47c41479401ea0aadfa3c3776fe2930bb8e9710d)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119883#comment-15119883
 ] 

Ted Yu commented on HBASE-15171:


Yu:
Mind attaching an addendum that addresses Ram's comment ?

> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-27 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119865#comment-15119865
 ] 

ramkrishna.s.vasudevan commented on HBASE-15171:


Instead of iterating again the map, can we just get the return value of 
map.add(kv), it it is false don't add the curSize?  
add() javadoc says this
{code}
add
public boolean add(E e)

Adds the specified element to this set if it is not already present. More 
formally, adds the specified element e to this set if the set contains no 
element e2 such that (e==null ? e2==null : e.equals(e2)). If this set already 
contains the element, the call leaves the set unchanged and returns false.
{code}

> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15171) Avoid counting duplicate kv and generating lots of small hfiles in PutSortReducer

2016-01-27 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15120006#comment-15120006
 ] 

Hudson commented on HBASE-15171:


SUCCESS: Integrated in HBase-1.3-IT #464 (See 
[https://builds.apache.org/job/HBase-1.3-IT/464/])
HBASE-15171 Avoid counting duplicate kv and generating lots of small (tedyu: 
rev 630ad95c923f642d006274b9b1a14397a6713412)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/PutSortReducer.java


> Avoid counting duplicate kv and generating lots of small hfiles in 
> PutSortReducer
> -
>
> Key: HBASE-15171
> URL: https://issues.apache.org/jira/browse/HBASE-15171
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.0.0, 1.1.2, 0.98.17
>Reporter: Yu Li
>Assignee: Yu Li
> Fix For: 2.0.0, 1.3.0
>
> Attachments: HBASE-15171.patch, HBASE-15171.patch, HBASE-15171.patch
>
>
> Once there was one of our online user writing huge number of duplicated kvs 
> during bulkload, and we found it generated lots of small hfiles and slows 
> down the whole process.
> After debugging, we found in PutSortReducer#reduce, although it already tried 
> to handle the pathological case by setting a threshold for single-row size 
> and having a TreeMap to avoid writing out duplicated kv, it forgot to exclude 
> duplicated kv from the accumulated size. As shown in below code segment:
> {code}
> while (iter.hasNext() && curSize < threshold) {
>   Put p = iter.next();
>   for (List cells: p.getFamilyCellMap().values()) {
> for (Cell cell: cells) {
>   KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
>   map.add(kv);
>   curSize += kv.heapSize();
> }
>   }
> }
> {code}
> We should move the {{curSize += kv.heapSize();}} line out of the outer for 
> loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)