[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245999#comment-13245999 ] Laxman commented on HBASE-5564: --- Yes Stack. I will take a look. Changes in this patch are in Default Mapper. IMO these changes shouldn't cause failures in custom mapper. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242189#comment-13242189 ] Laxman commented on HBASE-5564: --- Thanks for the commit stack. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241078#comment-13241078 ] Laxman commented on HBASE-5564: --- @stack, thanks for your review and clearing the findbugs. I was avoiding these changes as these are unrelated to this JIRA. @ram, thanks for reviewing the patch. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1697) Discretionary access control
[ https://issues.apache.org/jira/browse/HBASE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241232#comment-13241232 ] Laxman commented on HBASE-1697: --- No updates here from long time. From my understanding, to make HBase secure, we need huge contributions in this area. Also, this involves many challenges (architectural changes, maintain/break compatibility, ...). In spite of these challenges, it adds more value to HBase. Anyone interested to look into these security issues? Discretionary access control Key: HBASE-1697 URL: https://issues.apache.org/jira/browse/HBASE-1697 Project: HBase Issue Type: Improvement Components: security Reporter: Andrew Purtell Assignee: Andrew Purtell Consider implementing discretionary access control for HBase. Access control has three aspects: authentication, authorization and audit. - Authentication: Access is controlled by insisting on an authentication procedure to establish the identity of the user. The authentication procedure should minimally require a non-plaintext authentication factor (e.g. encrypted password with salt) and should ideally or at least optionally provide cryptographically strong confidence via public key certification. - Authorization: Access is controlled by specifying rights to resources via an access control list (ACL). An ACL is a list of permissions attached to an object. The list specifies who or what is allowed to access the object and what operations are allowed to be performed on the object, f.e. create, update, read, or delete. - Audit: Important actions taken by subjects should be logged for accountability, a chronological record which enables the full reconstruction and examination of a sequence of events, e.g. schema changes or data mutations. Logging activity should be protected from all subjects except for a restricted set with administrative privilege, perhaps to only a single super-user. Discretionary access control means the access policy for an object is determined by the owner of the object. Every object in the system must have a valid owner. Owners can assign access rights and permissions to other users. The initial owner of an object is the subject who created it. If subjects are deleted from a system, ownership of objects owned by them should revert to some super-user or otherwise valid default. HBase can enforce access policy at table, column family, or cell granularity. Cell granularity does not make much sense. An implementation which controls access at both the table and column family levels is recommended, though a first cut could consider control at the table level only. The initial set of permissions can be: Create (table schema or column family), update (table schema or column family), read (column family), delete (table or column family), execute (filters), and transfer ownership. The subject identities and access tokens could be stored in a new administrative table. ACLs on tables and column families can be stored in META. Access other than read access to catalog and administrative tables should be restricted to a set of administrative users or perhaps a single super-user. A data mutation on a user table by a subject without administrative or superuser privilege which results in a table split is an implicit temporary privilege elevation where the regionserver or master updates the catalog tables as necessary to support the split. Audit logging should be configurable on a per-table basis to avoid this overhead where it is not wanted. Consider supporting external authentication and subject identification mechanisms with Java library support: RADIUS/TACACS, Kerberos, LDAP. Consider logging audit trails to an HBase table (bigtable type schemas are natural for this) and optionally external logging options with Java library support -- syslog, etc., or maybe commons-logging is sufficient and punt to administrator to set up appropriate commons-logging/log4j configurations for their needs. If HBASE-1002 is considered, and the option to support filtering via upload of (perhaps complex) bytecode produced by some little language compiler is implemented, the execute privilege could be extended in a manner similar to how stored procedures in SQL land execute either with the privilege of the current user or the (table/procedure) creator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-1697) Discretionary access control
[ https://issues.apache.org/jira/browse/HBASE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242081#comment-13242081 ] Laxman commented on HBASE-1697: --- Thanks Gary for the info on Security. I'm going through the current implementation. Soon will take up some jiras. Discretionary access control Key: HBASE-1697 URL: https://issues.apache.org/jira/browse/HBASE-1697 Project: HBase Issue Type: Improvement Components: security Reporter: Andrew Purtell Assignee: Andrew Purtell Consider implementing discretionary access control for HBase. Access control has three aspects: authentication, authorization and audit. - Authentication: Access is controlled by insisting on an authentication procedure to establish the identity of the user. The authentication procedure should minimally require a non-plaintext authentication factor (e.g. encrypted password with salt) and should ideally or at least optionally provide cryptographically strong confidence via public key certification. - Authorization: Access is controlled by specifying rights to resources via an access control list (ACL). An ACL is a list of permissions attached to an object. The list specifies who or what is allowed to access the object and what operations are allowed to be performed on the object, f.e. create, update, read, or delete. - Audit: Important actions taken by subjects should be logged for accountability, a chronological record which enables the full reconstruction and examination of a sequence of events, e.g. schema changes or data mutations. Logging activity should be protected from all subjects except for a restricted set with administrative privilege, perhaps to only a single super-user. Discretionary access control means the access policy for an object is determined by the owner of the object. Every object in the system must have a valid owner. Owners can assign access rights and permissions to other users. The initial owner of an object is the subject who created it. If subjects are deleted from a system, ownership of objects owned by them should revert to some super-user or otherwise valid default. HBase can enforce access policy at table, column family, or cell granularity. Cell granularity does not make much sense. An implementation which controls access at both the table and column family levels is recommended, though a first cut could consider control at the table level only. The initial set of permissions can be: Create (table schema or column family), update (table schema or column family), read (column family), delete (table or column family), execute (filters), and transfer ownership. The subject identities and access tokens could be stored in a new administrative table. ACLs on tables and column families can be stored in META. Access other than read access to catalog and administrative tables should be restricted to a set of administrative users or perhaps a single super-user. A data mutation on a user table by a subject without administrative or superuser privilege which results in a table split is an implicit temporary privilege elevation where the regionserver or master updates the catalog tables as necessary to support the split. Audit logging should be configurable on a per-table basis to avoid this overhead where it is not wanted. Consider supporting external authentication and subject identification mechanisms with Java library support: RADIUS/TACACS, Kerberos, LDAP. Consider logging audit trails to an HBase table (bigtable type schemas are natural for this) and optionally external logging options with Java library support -- syslog, etc., or maybe commons-logging is sufficient and punt to administrator to set up appropriate commons-logging/log4j configurations for their needs. If HBASE-1002 is considered, and the option to support filtering via upload of (perhaps complex) bytecode produced by some little language compiler is implemented, the execute privilege could be extended in a manner similar to how stored procedures in SQL land execute either with the privilege of the current user or the (table/procedure) creator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240241#comment-13240241 ] Laxman commented on HBASE-5564: --- Another problem found in my testing. Invalid timestamp is not respecting skip.bad.lines configuration. I will update the patch for this as well. Adding some unit tests too. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4565) Maven HBase build broken on cygwin with copynativelib.sh call.
[ https://issues.apache.org/jira/browse/HBASE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240331#comment-13240331 ] Laxman commented on HBASE-4565: --- Is it ok if I rebase this patch to trunk? I need it to build in my windows env. Maven HBase build broken on cygwin with copynativelib.sh call. -- Key: HBASE-4565 URL: https://issues.apache.org/jira/browse/HBASE-4565 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.0 Environment: cygwin (on xp and win7) Reporter: Suraj Varma Assignee: Suraj Varma Labels: build, maven Fix For: 0.96.0 Attachments: HBASE-4565-0.92.patch, HBASE-4565-v2.patch, HBASE-4565-v3-0.92.patch, HBASE-4565-v3.patch, HBASE-4565.patch This is broken in both 0.92 as well as trunk pom.xml Here's a sample maven log snippet from trunk (from Mayuresh on user mailing list) [INFO] [antrun:run {execution: package}] [INFO] Executing tasks main: [mkdir] Created dir: D:\workspace\mkshirsa\hbase-trunk\target\hbase-0.93-SNAPSHOT\hbase-0.93-SNAPSHOT\lib\native\${build.platform} [exec] ls: cannot access D:workspacemkshirsahbase-trunktarget/nativelib: No such file or directory [exec] tar (child): Cannot connect to D: resolve failed [INFO] [ERROR] BUILD ERROR [INFO] [INFO] An Ant BuildException has occured: exec returned: 3328 There are two issues: 1) The ant run task below doesn't resolve the windows file separator returned by the project.build.directory - this causes the above resolve failed. !-- Using Unix cp to preserve symlinks, using script to handle wildcards -- echo file=${project.build.directory}/copynativelibs.sh if [ `ls ${project.build.directory}/nativelib | wc -l` -ne 0]; then 2) The tar argument value below also has a similar issue in that the path arg doesn't resolve right. !-- Using Unix tar to preserve symlinks -- exec executable=tar failonerror=yes dir=${project.build.directory}/${project.artifactId}-${project.version} arg value=czf/ arg value=/cygdrive/c/workspaces/hbase-0.92-svn/target/${project.artifactId}-${project.version}.tar.gz/ arg value=./ /exec In both cases, the fix would probably be to use a cross-platform way to handle the directory locations. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5640) bulk load runs slowly than before
[ https://issues.apache.org/jira/browse/HBASE-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240977#comment-13240977 ] Laxman commented on HBASE-5640: --- bq. There are many prints of the form. This is possibly a regression caused by a recent patch. bq. on different filesystem than destination store - moving to this filesystem @Dhruba, can you please provide more details? bulk load runs slowly than before - Key: HBASE-5640 URL: https://issues.apache.org/jira/browse/HBASE-5640 Project: HBase Issue Type: Bug Reporter: dhruba borthakur Assignee: dhruba borthakur Priority: Minor I am loading data from an external system into hbase. There are many prints of the form. This is possibly a regression caused by a recent patch. on different filesystem than destination store - moving to this filesystem -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239222#comment-13239222 ] Laxman commented on HBASE-5564: --- Findbugs reported by QA bot are about usage of default encoding. This behavior is inline with existing code. bug #1 {noformat} TESTUnknown bug pattern DM_DEFAULT_ENCODING in org.apache.hadoop.hbase.mapreduce.ImportTsv$TsvParser$ParsedLine.getTimestamp() {noformat} bug #2 {noformat} TESTUnknown bug pattern DM_DEFAULT_ENCODING in org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(Configuration, String[]) {noformat} bug #2 already existing in code. just included in patch file with no changes. And test case failures are not because of this patch. Test failures to be addressed as part of HBASE-5608 Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238452#comment-13238452 ] Laxman commented on HBASE-5564: --- @Stack, updated the patch after fixing your comments. Thanks for the review. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238095#comment-13238095 ] Laxman commented on HBASE-5564: --- @Anoop, thanks for clarification. @Stack, thanks for the review. I will update the patch. bq. need curlies bq. NO_TIMESTAMP_KEYCOLUMN_INDEX I will update the patch for above 2 comments. bq. Can you confirm that current behavior – setting ts to System.currentTimeMillis – is default? It seems to be ... we set System.currentTimeMillis as time to use setting up the job. Before patch, we are setting ts to System.currentTimeMillis in TsvImporterMapper.doSetup. This setup methos will be called for each mapper, i.e, for each input split. That means it uses a new timestamp for each map task. After patch, we are setting ts to conf.getLong which is same in all map tasks. Hope, I understood your question correctly. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234157#comment-13234157 ] Laxman commented on HBASE-5564: --- These tests are passing in my dev environment. {noformat} Running org.apache.hadoop.hbase.mapreduce.TestImportTsv Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 168.578 sec Results : Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] --- maven-surefire-plugin:2.12-TRUNK-HBASE-2:test (secondPartTestsExecution) @ hbase --- [INFO] Tests are skipped. [INFO] [INFO] BUILD SUCCESS [INFO] {noformat} Also, I can see these MR tests are failing in previous builds as well [HBase-5529]. Will check more. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5608) MR testcases are failing in QA builds
[ https://issues.apache.org/jira/browse/HBASE-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234279#comment-13234279 ] Laxman commented on HBASE-5608: --- Failing builds for logs https://builds.apache.org/job/PreCommit-HBASE-Build/1231/ https://builds.apache.org/job/PreCommit-HBASE-Build/1112/ https://builds.apache.org/job/PreCommit-HBASE-Build/1108/ I had gone through the logs available in these builds. But I couldn't get any clue why these testcases are failing. In case of TestImportTsv, MR job is failing quietly. MR testcases are failing in QA builds - Key: HBASE-5608 URL: https://issues.apache.org/jira/browse/HBASE-5608 Project: HBase Issue Type: Bug Components: build, mapreduce, test Affects Versions: 0.92.2 Environment: Hadoop QA - precommit builds Reporter: Laxman Priority: Blocker Labels: build-failure, mapreduce, test-fail Many of the MR testcases are failing in PreCommit builds (triggered by Hadoop QA). Failing testcases are a) TestImportTsv b) TestHFileOutputFormat c) TestTableMapReduce -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234281#comment-13234281 ] Laxman commented on HBASE-5564: --- thanks for the info Ram. I had spent sometime in analyzing these failures. But couldn't get a clue. Filed a separate JIRA HBASE-5608 to fix these test failures. As mentioned earlier all these test are passing in my local environment. Should we wait for HBASE-5608 or proceed with review commit? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235320#comment-13235320 ] Laxman commented on HBASE-5564: --- Ted, all these comments are related to line wrapping. IMO, 80 characters length is too low it makes the code bit ugly. If you strongly feel we need to stick this 80-length, I will fix these comments. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235342#comment-13235342 ] Laxman commented on HBASE-5564: --- Thanks Ted, for taking pain in getting the lint comments. As you suggested, I will start a discussion on dev@hbase. I just wanted to quote one example from this patch here. {code} long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis()); {code} Above code snippet after formatting, it turned to {code} long timstamp = conf .getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis()); {code} Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Fix For: 0.96.0 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233283#comment-13233283 ] Laxman commented on HBASE-5564: --- bq. Doing this will use the same TS across all the mappers. Is this the intention for this change? So in TsvImporterMapper, conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get from conf. Yes Anoop. we should have same timestamp for all mappers. Please check my previous comments on the scope of the issue. https://issues.apache.org/jira/browse/HBASE-5564?focusedCommentId=13228297page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13228297 Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233356#comment-13233356 ] Laxman commented on HBASE-5564: --- Any idea why QA bot is not testing this patch? Can someone trigger this explicitly? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234082#comment-13234082 ] Laxman commented on HBASE-5564: --- All MR tests seems to be failing. Failures are not because of the patch. I will check these failures. @anoop In bulkload, if multiple records are having same timestamp, then the last KV entry processed by reducer only will be persisted (TreeSet in Reducer). I don't see this as behavior inconsistency. Bulkload can't judge which KV entry to be retained (Considering duplicate records exists across input splits/files). So, in this case, user can develop custom MR to achieve this functionality. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228297#comment-13228297 ] Laxman commented on HBASE-5564: --- Scope of this issue. 1) Avoid the behavioral inconsistency with timestamp parameter. {noformat} Currently in code, a) If timstamp parameter is configured, duplicate records will be overwritten. b) If not configured, some duplicate records are maintained as different version. {noformat} This fix should be inline with the expectation Todd has mentioned. bq. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it. 2) Provide an option to look up timestamp column value from input data. (Like ROWKEY column) Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code' I will submit the patch with the above mentioned approach. Any other addons? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228406#comment-13228406 ] Laxman commented on HBASE-5564: --- While testing the patch in local, I'm getting the following error in trunk. Any hints on this please? {noformat} java.lang.RuntimeException: java.io.IOException: Call to localhost/127.0.0.1:0 failed on local exception: java.net.BindException: Cannot assign requested address: no further information at org.apache.hadoop.mapred.MiniMRCluster.waitUntilIdle(MiniMRCluster.java:323) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:524) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:462) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:454) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:446) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:436) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:426) at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:417) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1269) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1255) at org.apache.hadoop.hbase.mapreduce.TestImportTsv.doMROnTableTest(TestImportTsv.java:189) at org.apache.hadoop.hbase.mapreduce.TestImportTsv.testMROnTable(TestImportTsv.java:162) {noformat} Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228936#comment-13228936 ] Laxman commented on HBASE-5564: --- Thanks Stack. Let me give a try. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227597#comment-13227597 ] Laxman commented on HBASE-5564: --- I think this is a bug and its not any intentional behavior. Usage of TreeSet in the below code snippet is causing the issue. PutSortReducer.reduce() == TreeSetKeyValue map = new TreeSetKeyValue(KeyValue.COMPARATOR); long curSize = 0; // stop at the end or the RAM threshold while (iter.hasNext() curSize threshold) { Put p = iter.next(); for (ListKeyValue kvs : p.getFamilyMap().values()) { for (KeyValue kv : kvs) { map.add(kv); curSize += kv.getLength(); } } Changing this back to List and then sort explicitly will solve the issue. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227678#comment-13227678 ] Laxman commented on HBASE-5564: --- I tested again with the proposed patch. Changing this back to List and then sort explicitly will solve the issue. Still the same problem persists making this issue bit more complicated. I think the usage of same timestamp for all records in split causing the issue. Currently in code, a) If configured, we are using static timestamp for all mappers. b) If not configured, we are using current system time generated for each split. TsvImporterMapper.doSetup {code} ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, System.currentTimeMillis()); {code} Should we think of an approach to generate a unique sequence number and use it as a timestamp? Any other thoughts? Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records
[ https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228212#comment-13228212 ] Laxman commented on HBASE-5564: --- bq. ts++, or ts--, could be an option? ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new JVM and ts will be reset to initial value. so, still there is a chance of ts collision. bq. that the timestamps are all identical. The whole point is that, in a bulk-load-only workflow, you can identify each bulk load exactly, and correlate it to the MR job that inserted it. No Todd. At least the implementation is buggy enough and not matching with this expected behavior. New timestamp is generated for each map task (i.e., for each split) in TsvImporterMapper.doSetup. Please check my previous comments. bq. So this is only about ImportTsv? Should change the title in that case. I'm not aware what other tools comes under bulkload. Bulkload documentation talks only about importtsv. http://hbase.apache.org/bulk-loads.html But if you feel we should change the title, feel free to modify the title. bq. If you want to use custom timestamps, you should specify a timestamp column in your data, or write your own MR job (ImportTsv is just an example which use useful for some cases, but for anything advanced I would expect users to write their own code) I think we can provide the provision to specify the timestamp column (Like ROWKEY column) as arguments. Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY, emp:name,emp:sal,dept:code' This makes importtsv more usable. Otherwise, user has to copy paste entire importtsv code and do this minor modification. Please let me know your suggestions on this. Bulkload is discarding duplicate records Key: HBASE-5564 URL: https://issues.apache.org/jira/browse/HBASE-5564 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Environment: HBase 0.92 Reporter: Laxman Assignee: Laxman Labels: bulkloader Duplicate records are getting discarded when duplicate records exists in same input file and more specifically if they exists in same split. Duplicate records are considered if the records are from diffrent different splits. Version under test: HBase 0.92 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5531) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot
[ https://issues.apache.org/jira/browse/HBASE-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223223#comment-13223223 ] Laxman commented on HBASE-5531: --- This patch involves build xml (pom.xml) changes only. Above -1s are irrelevant to the changes. Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot - Key: HBASE-5531 URL: https://issues.apache.org/jira/browse/HBASE-5531 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.2 Reporter: Laxman Labels: build Fix For: 0.92.2, 0.96.0 Attachments: HBASE-5531-trunk.patch, HBASE-5531.patch Current profile is still pointing to 0.23.1-SNAPSHOT. This is failing to build as 23.1 is already released and snapshot is not available anymore. We can update this to 0.23.2-SNAPSHOT. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira