[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2013-03-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13593192#comment-13593192
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-0.94-security-on-Hadoop-23 #12 (See 
[https://builds.apache.org/job/HBase-0.94-security-on-Hadoop-23/12/])
HBASE-7793 Port HBASE-5564 Bulkload is discarding duplicate records to 0.94 
(Ted Yu) (Revision 1443842)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.95.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_1.patch, 
 HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2013-02-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13574279#comment-13574279
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-0.94 #834 (See 
[https://builds.apache.org/job/HBase-0.94/834/])
HBASE-7793 Port HBASE-5564 Bulkload is discarding duplicate records to 0.94 
(Ted Yu) (Revision 1443842)

 Result = ABORTED
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_1.patch, 
 HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2013-02-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13574291#comment-13574291
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-0.94-security #109 (See 
[https://builds.apache.org/job/HBase-0.94-security/109/])
HBASE-7793 Port HBASE-5564 Bulkload is discarding duplicate records to 0.94 
(Ted Yu) (Revision 1443842)

 Result = SUCCESS
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_1.patch, 
 HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2013-02-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13574295#comment-13574295
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-0.94 #835 (See 
[https://builds.apache.org/job/HBase-0.94/835/])
HBASE-7793 Port HBASE-5564 Bulkload is discarding duplicate records to 0.94 
(Ted Yu) (Revision 1443842)

 Result = SUCCESS
tedyu : 
Files : 
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_1.patch, 
 HBASE-5564.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295775#comment-13295775
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK #3030 (See 
[https://builds.apache.org/job/HBase-TRUNK/3030/])
HBASE-5564 Bulkload is discarding duplicate records

Submitted by:Laxman 
Reviewed by:iStack, Ted, Ram (Revision 1350691)

 Result = FAILURE
ramkrishna : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13296053#comment-13296053
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #55 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/55/])
HBASE-5564 Bulkload is discarding duplicate records

Submitted by:Laxman 
Reviewed by:iStack, Ted, Ram (Revision 1350691)

 Result = FAILURE
ramkrishna : 
Files : 
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13295345#comment-13295345
 ] 

stack commented on HBASE-5564:
--

+1 on commit.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-12 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293469#comment-13293469
 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
---

Ok.. I will make that change and reupload the patch..Thanks Ted.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, 
 HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293815#comment-13293815
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531854/HBASE-5564_1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.coprocessor.TestClassLoading

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2153//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2153//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2153//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2153//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-11 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292746#comment-13292746
 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
---

We got the problem.  It was because there was a space created in the latest 
patch in the testcase
' = org.apache.hadoop.hbase.mapreduce.TsvImporterCustomTestMapper,'.  There 
should not be any space before and after '='.

Will rebase the patch so that it can be recommitted.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293238#comment-13293238
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12531698/HBASE-5564.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 hadoop2.0.  The patch compiles against the hadoop 2.0 profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 5 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2136//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2136//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2136//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/2136//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, 
 HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-11 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293367#comment-13293367
 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
---

All the tests are passing.. Will integrate tomorrow if there are no objections.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, 
 HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-06-11 Thread Zhihong Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13293375#comment-13293375
 ] 

Zhihong Ted Yu commented on HBASE-5564:
---

Minor comment:
{code}
+  throw new BadTsvLineException(Invalid timestamp);
{code}
Can the timestamp string be included ?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, 
 HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-04-24 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260282#comment-13260282
 ] 

stack commented on HBASE-5564:
--

@Laxman Any luck?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-04-03 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245999#comment-13245999
 ] 

Laxman commented on HBASE-5564:
---

Yes Stack. I will take a look. Changes in this patch are in Default Mapper. IMO 
these changes shouldn't cause failures in custom mapper.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242189#comment-13242189
 ] 

Laxman commented on HBASE-5564:
---

Thanks for the commit stack.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242477#comment-13242477
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK #2698 (See 
[https://builds.apache.org/job/HBase-TRUNK/2698/])
HBASE-5564 Bulkload is discarding duplicate records (Revision 1306907)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242785#comment-13242785
 ] 

stack commented on HBASE-5564:
--

Thanks Ted.  I reverted the patch for now.  Laxman, mind taking a looksee at 
the failures Ted found in TestImportTsv#testMROnTableWithCustomMapper?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242825#comment-13242825
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK #2701 (See 
[https://builds.apache.org/job/HBase-TRUNK/2701/])
HBASE-5564 Bulkload is discarding duplicate records (Revision 1307629)

 Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243043#comment-13243043
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK-security #155 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/155/])
HBASE-5564 Bulkload is discarding duplicate records (Revision 1307629)

 Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241021#comment-13241021
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520371/5564v5.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1336//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1336//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1336//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241076#comment-13241076
 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
---

+1 on v5.  

Thanks for the patch Lakshman and Stack.  
@Stack
So any new patches that we give should not have any findbugs even if in the old 
existing code? Ok i will take care of this and ensure people submitting patches 
over here also do that. Thanks.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241078#comment-13241078
 ] 

Laxman commented on HBASE-5564:
---

@stack, thanks for your review and clearing the findbugs.
I was avoiding these changes as these are unrelated to this JIRA.

@ram, thanks for reviewing the patch.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Uma Maheswara Rao G (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241167#comment-13241167
 ] 

Uma Maheswara Rao G commented on HBASE-5564:


Also don't forget to update the count in test-patch.properties according to the 
present count if we fix any existing findbugs.

+Uma

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241307#comment-13241307
 ] 

stack commented on HBASE-5564:
--

Committed to trunk.  Thanks for the patch Laxman.  Thanks for the reminder on 
updating the count Uma. It seems that my minor addition only stopped the count 
rising so I didn't have to change the findbugs count (the test build was seeing 
two new findbug warnings when in fact there were none -- a variable name change 
was making it think the findbugs count had gone up).

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241341#comment-13241341
 ] 

Jonathan Hsieh commented on HBASE-5564:
---

maybe not worry about find bugs for normal patches? (ideally it does go up 
though) the find bugs number isn't the focus of this patch.

Sent from my iPhone




 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Jonathan Hsieh (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241485#comment-13241485
 ] 

Jonathan Hsieh commented on HBASE-5564:
---

meant to say ideally it does not go up.  I think stack's action (he didn't 
lower findbugs number on normal patch) captured the same idea.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242065#comment-13242065
 ] 

Hudson commented on HBASE-5564:
---

Integrated in HBase-TRUNK-security #154 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/154/])
HBASE-5564 Bulkload is discarding duplicate records (Revision 1306907)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/ImportTsv.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/TsvImporterMapper.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportTsv.java


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-28 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240241#comment-13240241
 ] 

Laxman commented on HBASE-5564:
---

Another problem found in my testing. Invalid timestamp is not respecting 
skip.bad.lines configuration.
I will update the patch for this as well. Adding some unit tests too.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-28 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240345#comment-13240345
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12520255/HBASE-5564_trunk.4_final.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1326//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1326//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1326//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-27 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239222#comment-13239222
 ] 

Laxman commented on HBASE-5564:
---

Findbugs reported by QA bot are about usage of default encoding. This behavior 
is inline with existing code.


bug #1
{noformat}
TESTUnknown bug pattern DM_DEFAULT_ENCODING in 
org.apache.hadoop.hbase.mapreduce.ImportTsv$TsvParser$ParsedLine.getTimestamp()
{noformat}

bug #2
{noformat}
TESTUnknown bug pattern DM_DEFAULT_ENCODING in 
org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(Configuration, 
String[])
{noformat}

bug #2 already existing in code. just included in patch file with no changes.

And test case failures are not because of this patch. Test failures to be 
addressed as part of HBASE-5608

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-27 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239392#comment-13239392
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12520096/HBASE-5564_trunk.3.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 1 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1319//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1319//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1319//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-26 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238452#comment-13238452
 ] 

Laxman commented on HBASE-5564:
---

@Stack, updated the patch after fixing your comments. Thanks for the review.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-26 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238506#comment-13238506
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12519960/HBASE-5564_trunk.2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 2 new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1305//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1305//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1305//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-25 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238095#comment-13238095
 ] 

Laxman commented on HBASE-5564:
---

@Anoop, thanks for clarification.

@Stack, thanks for the review. I will update the patch.

bq. need curlies
bq. NO_TIMESTAMP_KEYCOLUMN_INDEX 

I will update the patch for above 2 comments.

bq. Can you confirm that current behavior – setting ts to 
System.currentTimeMillis – is default? It seems to be ... we set 
System.currentTimeMillis as time to use setting up the job.

Before patch, we are setting ts to System.currentTimeMillis in 
TsvImporterMapper.doSetup. This setup methos will be called for each mapper, 
i.e, for each input split. That means it uses a new timestamp for each map task.

After patch, we are setting ts to conf.getLong which is same in all map tasks.

Hope, I understood your question correctly.


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-24 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237781#comment-13237781
 ] 

stack commented on HBASE-5564:
--

Patch seems reasonable.

Add curlies here:

{code}
+  if (parser.getTimestampKeyColumnIndex() != -1)
+ts = parsed.getTimestamp();
{code}

Convention is you can do w/o curlies if all in one line (as you do later in 
this file) but if not on one line, need curlies.

Can you confirm that current behavior -- setting ts to System.currentTimeMillis 
-- is default?  It seems to be ... we set System.currentTimeMillis as time to 
use setting up the job.

A define for NO_TIMESTAMP_KEYCOLUMN_INDEX instead of using -1 directly might 
help for timestampKeyColumnIndex == -1?  Or put this test into a method whose 
name makes it obvious what the test is about ... e.g. hasTimeStampColumn()

Patch adds nice usage commentary explaining new facility.

Looks good.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-23 Thread Anoop Sam John (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236441#comment-13236441
 ] 

Anoop Sam John commented on HBASE-5564:
---

{quote}
In bulkload, if multiple records are having same timestamp, then the last KV 
entry processed by reducer only will be persisted (TreeSet in Reducer)
{quote}

The 1st KV processed by the Reducer right...

Yes agree with you which one is the latest might not be possible to be 
predicted in the reducer side...

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234157#comment-13234157
 ] 

Laxman commented on HBASE-5564:
---

These tests are passing in my dev environment.

{noformat}
Running org.apache.hadoop.hbase.mapreduce.TestImportTsv
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 168.578 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-surefire-plugin:2.12-TRUNK-HBASE-2:test 
(secondPartTestsExecution) @ hbase ---
[INFO] Tests are skipped.
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
{noformat}

Also, I can see these MR tests are failing in previous builds as well 
[HBase-5529].

Will check more. 

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread ramkrishna.s.vasudevan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234244#comment-13234244
 ] 

ramkrishna.s.vasudevan commented on HBASE-5564:
---

The test cases that fail is common in HadoopQA.  As your patch is changing the 
ImportTsv part people will be worried.
But as you have run it locally and ensured that it is passing the main build 
should be able to pass it.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234281#comment-13234281
 ] 

Laxman commented on HBASE-5564:
---

thanks for the info Ram.

I had spent sometime in analyzing these failures. But couldn't get a clue.
Filed a separate JIRA HBASE-5608 to fix these test failures.

As mentioned earlier all these test are passing in my local environment.

Should we wait for HBASE-5608 or proceed with review  commit?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234457#comment-13234457
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12519254/5564.lint
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 1 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1235//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235320#comment-13235320
 ] 

Laxman commented on HBASE-5564:
---

Ted, all these comments are related to line wrapping.
IMO, 80 characters length is too low  it makes the code bit ugly.

If you strongly feel we need to stick this 80-length, I will fix these comments.


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235327#comment-13235327
 ] 

Zhihong Yu commented on HBASE-5564:
---

We have been using 80 characters as line length for a while.

At Google, line length is enforced, though the limit is bit longer.

Feel free to start discussion on dev@hbase about the acceptable limit.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235342#comment-13235342
 ] 

Laxman commented on HBASE-5564:
---

Thanks Ted, for taking pain in getting the lint comments.
As you suggested, I will start a discussion on dev@hbase.

I just wanted to quote one example from this patch here.
{code}
long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, 
System.currentTimeMillis());
{code}

Above code snippet after formatting, it turned to
{code}
long timstamp = conf
.getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Anoop Sam John (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233238#comment-13233238
 ] 

Anoop Sam John commented on HBASE-5564:
---

@Laxman
ImportTsv
{code}
+// If timestamp option is not specified, use current system time.
+long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, 
System.currentTimeMillis());
+
+// Set it back to replace invalid timestamp (non-numeric) with current 
system time
+conf.setLong(TIMESTAMP_CONF_KEY, timstamp);
{code}

Doing this will use the same TS across all the mappers. Is this the intention 
for this change? So in TsvImporterMapper, 
conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get 
from conf.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233283#comment-13233283
 ] 

Laxman commented on HBASE-5564:
---

bq. Doing this will use the same TS across all the mappers. Is this the 
intention for this change? So in TsvImporterMapper, 
conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get 
from conf.

Yes Anoop. we should have same timestamp for all mappers.
Please check my previous comments on the scope of the issue.

https://issues.apache.org/jira/browse/HBASE-5564?focusedCommentId=13228297page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13228297

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233356#comment-13233356
 ] 

Laxman commented on HBASE-5564:
---

Any idea why QA bot is not testing this patch?
Can someone trigger this explicitly?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Anoop Sam John (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233417#comment-13233417
 ] 

Anoop Sam John commented on HBASE-5564:
---

Comment from Jesse Yates
{quote}
The question is, if you have a TSV file with the same row key, which value 
should be considered the most recent version? Should any of them - maybe that 
is actually a problem and we want to have a warning/error when that occurs?
{quote}

Do we need to handle this? The issue is TreeSet used by PutSortReducer and 
KeyValueSortReducer as mentioned by Laxman. 
In normal data insertion using Puts, all the duplicate values will go into the 
memstore (and finally to HFiles) and while scan the last entered one will get 
retrieved. In this bulk load case the 1st data only will get inserted as DS 
avoid the duplicates. Is this a behaviour mismatch?  But this depends on which 
entry in the TSV file needs to be considered as the recent version.If we say 
that last entry coming in the file is the recent version.


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233424#comment-13233424
 ] 

Zhihong Yu commented on HBASE-5564:
---

@Laxman:
Please take a look at 
https://builds.apache.org/job/PreCommit-HBASE-Build/1229/console and see which 
test timed out.

I have sent an email to bui...@apache.org, informing them of the issue for 
Hadoop QA.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233910#comment-13233910
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12519054/HBASE-5564_trunk.1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 165 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.regionserver.TestServerCustomProtocol
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1231//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1231//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1231//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234082#comment-13234082
 ] 

Laxman commented on HBASE-5564:
---

All MR tests seems to be failing. Failures are not because of the patch.
I will check these failures.

@anoop
In bulkload, if multiple records are having same timestamp, then the last KV 
entry processed by reducer only will be persisted (TreeSet in Reducer). I don't 
see this as behavior inconsistency. Bulkload can't judge which KV entry to be 
retained (Considering duplicate records exists across input splits/files). So, 
in this case, user can develop custom MR to achieve this functionality.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-19 Thread Ted Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233179#comment-13233179
 ] 

Ted Yu commented on HBASE-5564:
---

{code}
+public int getTimestapKeyColumnIndex() {
{code}
Please fix typo in the above method name.
{code}
+-D + TIMESTAMP_CONF_KEY + =currentTimeAsLong - use the specified 
timestamp for the import. This option is ignored if HBASE_TS_KEY is specfied in 
'importtsv.columns'\n +
{code}
Please wrap the long line above.
{code}
+// Should never get 0.
+ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0);
{code}
Please explain why 0 wouldn't be returned.
{code}
+  if (parser.getTimestapKeyColumnIndex() != -1)
+ts = parsed.getTimestamp();
{code}
Please use curly braces around the assignment.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-19 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233204#comment-13233204
 ] 

Hadoop QA commented on HBASE-5564:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12519017/HBASE-5564_trunk.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 165 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1229//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1229//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1229//console

This message is automatically generated.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228297#comment-13228297
 ] 

Laxman commented on HBASE-5564:
---

Scope of this issue.

1) Avoid the behavioral inconsistency with timestamp parameter.

{noformat}
Currently in code,
a) If timstamp parameter is configured, duplicate records will be overwritten.
b) If not configured, some duplicate records are maintained as different 
version.
{noformat}

This fix should be inline with the expectation Todd has mentioned.

bq. The whole point is that, in a bulk-load-only workflow, you can identify 
each bulk load exactly, and correlate it to the MR job that inserted it.

2) Provide an option to look up timestamp column value from input data. (Like 
ROWKEY column)
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, 
emp:name,emp:sal,dept:code'

I will submit the patch with the above mentioned approach.

Any other addons?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228406#comment-13228406
 ] 

Laxman commented on HBASE-5564:
---

While testing the patch in local, I'm getting the following error in trunk.
Any hints on this please?

{noformat}
java.lang.RuntimeException: java.io.IOException: Call to localhost/127.0.0.1:0 
failed on local exception: java.net.BindException: Cannot assign requested 
address: no further information
at 
org.apache.hadoop.mapred.MiniMRCluster.waitUntilIdle(MiniMRCluster.java:323)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:524)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:462)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:454)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:446)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:436)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:426)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:417)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1269)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1255)
at 
org.apache.hadoop.hbase.mapreduce.TestImportTsv.doMROnTableTest(TestImportTsv.java:189)
at 
org.apache.hadoop.hbase.mapreduce.TestImportTsv.testMROnTable(TestImportTsv.java:162)
{noformat}

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228467#comment-13228467
 ] 

stack commented on HBASE-5564:
--

Googling it, its either something is already listening on the port of your 
127.0.0.1 has been removed?   See 
http://www-01.ibm.com/support/docview.wss?uid=swg21233733

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228936#comment-13228936
 ] 

Laxman commented on HBASE-5564:
---

Thanks Stack. Let me give a try.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227597#comment-13227597
 ] 

Laxman commented on HBASE-5564:
---

I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
==
  TreeSetKeyValue map = new TreeSetKeyValue(KeyValue.COMPARATOR);
  long curSize = 0;
  // stop at the end or the RAM threshold
  while (iter.hasNext()  curSize  threshold) {
Put p = iter.next();
for (ListKeyValue kvs : p.getFamilyMap().values()) {
  for (KeyValue kv : kvs) {
map.add(kv);
curSize += kv.getLength();
  }
}

Changing this back to List and then sort explicitly will solve the issue.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227678#comment-13227678
 ] 

Laxman commented on HBASE-5564:
---

I tested again with the proposed patch.
  Changing this back to List and then sort explicitly will solve the issue.

Still the same problem persists making this issue bit more complicated. 
I think the usage of same timestamp for all records in split causing the issue.

Currently in code,
a) If configured, we are using static timestamp for all mappers.
b) If not configured, we are using current system time generated for each split.

TsvImporterMapper.doSetup

{code}
ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

Should we think of an approach to generate a unique sequence number and use it 
as a timestamp?

Any other thoughts?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Jesse Yates (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227696#comment-13227696
 ] 

Jesse Yates commented on HBASE-5564:


Hmm, I think your right with this being a problem. It would be totally 
reasonable to change 
{code}
   KeyValue kv = new KeyValue(
lineBytes, parsed.getRowKeyOffset(), parsed.getRowKeyLength(),
parser.getFamily(i), 0, parser.getFamily(i).length,
parser.getQualifier(i), 0, parser.getQualifier(i).length,
ts,
KeyValue.Type.Put,
lineBytes, parsed.getColumnOffset(i), parsed.getColumnLength(i));
{code}

to use something like: {code}ts++{code}

The question is, if you have a TSV file with the same row key, which value 
should be considered the most recent version? Should any of them - maybe that 
is actually a problem and we want to have a warning/error when that occurs?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227740#comment-13227740
 ] 

stack commented on HBASE-5564:
--

The TreeSet is whats going to be used once the edits make it into the server so 
losing them in the reducer is probably optimal?  The Jesse ts++, or ts--, could 
be an option?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Todd Lipcon (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227742#comment-13227742
 ] 

Todd Lipcon commented on HBASE-5564:


I think it's a feature, not a bug, that the timestamps are all identical. The 
whole point is that, in a bulk-load-only workflow, you can identify each bulk 
load exactly, and correlate it to the MR job that inserted it. If you want to 
use custom timestamps, you should specify a timestamp column in your data, or 
write your own MR job (ImportTsv is just an example which use useful for some 
cases, but for anything advanced I would expect users to write their own code)

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227926#comment-13227926
 ] 

Lars Hofhansl commented on HBASE-5564:
--

So this is only about ImportTsv? Should change the title in that case.

I agree with Todd, at least for ImportTsv.
Import/Export should not (and hopefully do not) exhibit this behavior (since we 
want to be able to import/export KVs with multiple versions).


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228212#comment-13228212
 ] 

Laxman commented on HBASE-5564:
---

bq. ts++, or ts--, could be an option?

ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new 
JVM and ts will be reset to initial value. so, still there is a chance of ts 
collision.

bq. that the timestamps are all identical. The whole point is that, in a 
bulk-load-only workflow, you can identify each bulk load exactly, and correlate 
it to the MR job that inserted it.

No Todd. At least the implementation is buggy enough and not matching with this 
expected behavior.
New timestamp is generated for each map task (i.e., for each split) in 
TsvImporterMapper.doSetup.
Please check my previous comments.

bq. So this is only about ImportTsv? Should change the title in that case.
I'm not aware what other tools comes under bulkload. Bulkload documentation 
talks only about importtsv.
http://hbase.apache.org/bulk-loads.html

But if you feel we should change the title, feel free to modify the title.

bq. If you want to use custom timestamps, you should specify a timestamp column 
in your data, or write your own MR job (ImportTsv is just an example which use 
useful for some cases, but for anything advanced I would expect users to write 
their own code)

I think we can provide the provision to specify the timestamp column (Like 
ROWKEY column) as arguments.
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, 
emp:name,emp:sal,dept:code'

This makes importtsv more usable. Otherwise, user has to copy paste entire 
importtsv code and do this minor modification.

Please let me know your suggestions on this.


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228219#comment-13228219
 ] 

Zhihong Yu commented on HBASE-5564:
---

bq. I think we can provide the provision to specify the timestamp column (Like 
ROWKEY column) as arguments.
The above is reasonable.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228221#comment-13228221
 ] 

Lars Hofhansl commented on HBASE-5564:
--

@Laxman: so what you have in your CSV file is entries like:
rowA, colA, val1
rowA, colA, val2

And the expectation is that HBase should create two versions:
(rowA, colA, ts1) - val1
(rowA, colA, ts2) - val2
?

Seems like a pretty constructed case to me :)
How would know ahead of time how many versions you'd need to configure for your 
column family? 3 is the default, but what if you have 100 versions of the same 
row/col combo in your CSV file?

But anyway, having an option to specify a column for the TS is a good idea.
Do you want to take a stab at it Laxman?


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira