from:"Laxman \(Commented\) \(JIRA\)"

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-04-03 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245999#comment-13245999
 ] 

Laxman commented on HBASE-5564:
---

Yes Stack. I will take a look. Changes in this patch are in Default Mapper. IMO 
these changes shouldn't cause failures in custom mapper.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-30 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242189#comment-13242189
 ] 

Laxman commented on HBASE-5564:
---

Thanks for the commit stack.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-29 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241078#comment-13241078
 ] 

Laxman commented on HBASE-5564:
---

@stack, thanks for your review and clearing the findbugs.
I was avoiding these changes as these are unrelated to this JIRA.

@ram, thanks for reviewing the patch.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, 5564v5.txt, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.4_final.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1697) Discretionary access control

2012-03-29 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241232#comment-13241232
]

Laxman commented on HBASE-1697:
---

No updates here from long time.
From my understanding, to make HBase secure, we need huge contributions in
this area.
Also, this involves many challenges (architectural changes, maintain/break
compatibility, ...).
In spite of these challenges, it adds more value to HBase.

Anyone interested to look into these security issues?

Discretionary access control

Key: HBASE-1697
URL: https://issues.apache.org/jira/browse/HBASE-1697
Project: HBase
Issue Type: Improvement
Components: security
Reporter: Andrew Purtell
Assignee: Andrew Purtell

Consider implementing discretionary access control for HBase.
Access control has three aspects: authentication, authorization and audit.
- Authentication: Access is controlled by insisting on an authentication
procedure to establish the identity of the user. The authentication procedure
should minimally require a non-plaintext authentication factor (e.g.
encrypted password with salt) and should ideally or at least optionally
provide cryptographically strong confidence via public key certification.
- Authorization: Access is controlled by specifying rights to resources via
an access control list (ACL). An ACL is a list of permissions attached to an
object. The list specifies who or what is allowed to access the object and
what operations are allowed to be performed on the object, f.e. create,
update, read, or delete.
- Audit: Important actions taken by subjects should be logged for
accountability, a chronological record which enables the full reconstruction
and examination of a sequence of events, e.g. schema changes or data
mutations. Logging activity should be protected from all subjects except for
a restricted set with administrative privilege, perhaps to only a single
super-user.
Discretionary access control means the access policy for an object is
determined by the owner of the object. Every object in the system must have a
valid owner. Owners can assign access rights and permissions to other users.
The initial owner of an object is the subject who created it. If subjects are
deleted from a system, ownership of objects owned by them should revert to
some super-user or otherwise valid default.
HBase can enforce access policy at table, column family, or cell granularity.
Cell granularity does not make much sense. An implementation which controls
access at both the table and column family levels is recommended, though a
first cut could consider control at the table level only. The initial set of
permissions can be: Create (table schema or column family), update (table
schema or column family), read (column family), delete (table or column
family), execute (filters), and transfer ownership. The subject identities
and access tokens could be stored in a new administrative table. ACLs on
tables and column families can be stored in META.
Access other than read access to catalog and administrative tables should be
restricted to a set of administrative users or perhaps a single super-user. A
data mutation on a user table by a subject without administrative or
superuser privilege which results in a table split is an implicit temporary
privilege elevation where the regionserver or master updates the catalog
tables as necessary to support the split.
Audit logging should be configurable on a per-table basis to avoid this
overhead where it is not wanted.
Consider supporting external authentication and subject identification
mechanisms with Java library support: RADIUS/TACACS, Kerberos, LDAP.
Consider logging audit trails to an HBase table (bigtable type schemas are
natural for this) and optionally external logging options with Java library
support -- syslog, etc., or maybe commons-logging is sufficient and punt to
administrator to set up appropriate commons-logging/log4j configurations for
their needs.
If HBASE-1002 is considered, and the option to support filtering via upload
of (perhaps complex) bytecode produced by some little language compiler is
implemented, the execute privilege could be extended in a manner similar to
how stored procedures in SQL land execute either with the privilege of the
current user or the (table/procedure) creator.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-1697) Discretionary access control

2012-03-29 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242081#comment-13242081
]

Laxman commented on HBASE-1697:
---

Thanks Gary for the info on Security.
I'm going through the current implementation.
Soon will take up some jiras.

Discretionary access control

Key: HBASE-1697
URL: https://issues.apache.org/jira/browse/HBASE-1697
Project: HBase
Issue Type: Improvement
Components: security
Reporter: Andrew Purtell
Assignee: Andrew Purtell

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-28 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240241#comment-13240241
 ] 

Laxman commented on HBASE-5564:
---

Another problem found in my testing. Invalid timestamp is not respecting 
skip.bad.lines configuration.
I will update the patch for this as well. Adding some unit tests too.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.3.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4565) Maven HBase build broken on cygwin with copynativelib.sh call.

2012-03-28 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240331#comment-13240331
 ] 

Laxman commented on HBASE-4565:
---

Is it ok if I rebase this patch to trunk?
I need it to build in my windows env.

 Maven HBase build broken on cygwin with copynativelib.sh call.
 --

 Key: HBASE-4565
 URL: https://issues.apache.org/jira/browse/HBASE-4565
 Project: HBase
  Issue Type: Bug
  Components: build
Affects Versions: 0.92.0
 Environment: cygwin (on xp and win7)
Reporter: Suraj Varma
Assignee: Suraj Varma
  Labels: build, maven
 Fix For: 0.96.0

 Attachments: HBASE-4565-0.92.patch, HBASE-4565-v2.patch, 
 HBASE-4565-v3-0.92.patch, HBASE-4565-v3.patch, HBASE-4565.patch


 This is broken in both 0.92 as well as trunk pom.xml
 Here's a sample maven log snippet from trunk (from Mayuresh on user mailing 
 list)
 [INFO] [antrun:run {execution: package}]
 [INFO] Executing tasks
 main:
[mkdir] Created dir: 
 D:\workspace\mkshirsa\hbase-trunk\target\hbase-0.93-SNAPSHOT\hbase-0.93-SNAPSHOT\lib\native\${build.platform}
 [exec] ls: cannot access D:workspacemkshirsahbase-trunktarget/nativelib: 
 No such file or directory
 [exec] tar (child): Cannot connect to D: resolve failed
 [INFO] 
 
 [ERROR] BUILD ERROR
 [INFO] 
 
 [INFO] An Ant BuildException has occured: exec returned: 3328
 There are two issues: 
 1) The ant run task below doesn't resolve the windows file separator returned 
 by the project.build.directory - this causes the above resolve failed.
 !-- Using Unix cp to preserve symlinks, using script to handle wildcards --
 echo file=${project.build.directory}/copynativelibs.sh
 if [ `ls ${project.build.directory}/nativelib | wc -l` -ne 0]; then
 2) The tar argument value below also has a similar issue in that the path arg 
 doesn't resolve right.
 !-- Using Unix tar to preserve symlinks --
 exec executable=tar failonerror=yes 
 dir=${project.build.directory}/${project.artifactId}-${project.version}
 arg value=czf/
 arg 
 value=/cygdrive/c/workspaces/hbase-0.92-svn/target/${project.artifactId}-${project.version}.tar.gz/
 arg value=./
 /exec
 In both cases, the fix would probably be to use a cross-platform way to 
 handle the directory locations. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5640) bulk load runs slowly than before

2012-03-28 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240977#comment-13240977
 ] 

Laxman commented on HBASE-5640:
---

bq. There are many prints of the form. This is possibly a regression caused by 
a recent patch.
bq. on different filesystem than destination store - moving to this 
filesystem

@Dhruba, can you please provide more details?

 bulk load runs slowly than before
 -

 Key: HBASE-5640
 URL: https://issues.apache.org/jira/browse/HBASE-5640
 Project: HBase
  Issue Type: Bug
Reporter: dhruba borthakur
Assignee: dhruba borthakur
Priority: Minor

 I am loading data from an external system into hbase. There are many prints 
 of the form. This is possibly a regression caused by a recent patch.
 on different filesystem than destination store - moving to this filesystem

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-27 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13239222#comment-13239222
 ] 

Laxman commented on HBASE-5564:
---

Findbugs reported by QA bot are about usage of default encoding. This behavior 
is inline with existing code.


bug #1
{noformat}
TESTUnknown bug pattern DM_DEFAULT_ENCODING in 
org.apache.hadoop.hbase.mapreduce.ImportTsv$TsvParser$ParsedLine.getTimestamp()
{noformat}

bug #2
{noformat}
TESTUnknown bug pattern DM_DEFAULT_ENCODING in 
org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(Configuration, 
String[])
{noformat}

bug #2 already existing in code. just included in patch file with no changes.

And test case failures are not because of this patch. Test failures to be 
addressed as part of HBASE-5608

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-26 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238452#comment-13238452
 ] 

Laxman commented on HBASE-5564:
---

@Stack, updated the patch after fixing your comments. Thanks for the review.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.2.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-25 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238095#comment-13238095
]

Laxman commented on HBASE-5564:
---

@Anoop, thanks for clarification.

@Stack, thanks for the review. I will update the patch.

bq. need curlies
bq. NO_TIMESTAMP_KEYCOLUMN_INDEX

I will update the patch for above 2 comments.

bq. Can you confirm that current behavior – setting ts to
System.currentTimeMillis – is default? It seems to be ... we set
System.currentTimeMillis as time to use setting up the job.

Before patch, we are setting ts to System.currentTimeMillis in
TsvImporterMapper.doSetup. This setup methos will be called for each mapper,
i.e, for each input split. That means it uses a new timestamp for each map task.

After patch, we are setting ts to conf.getLong which is same in all map tasks.

Hope, I understood your question correctly.

Bulkload is discarding duplicate records

Key: HBASE-5564
URL: https://issues.apache.org/jira/browse/HBASE-5564
Project: HBase
Issue Type: Bug
Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
Labels: bulkloader
Fix For: 0.96.0

Attachments: 5564.lint, HBASE-5564_trunk.1.patch,
HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch

Duplicate records are getting discarded when duplicate records exists in same
input file and more specifically if they exists in same split.
Duplicate records are considered if the records are from diffrent different
splits.
Version under test: HBase 0.92

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234157#comment-13234157
 ] 

Laxman commented on HBASE-5564:
---

These tests are passing in my dev environment.

{noformat}
Running org.apache.hadoop.hbase.mapreduce.TestImportTsv
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 168.578 sec

Results :

Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- maven-surefire-plugin:2.12-TRUNK-HBASE-2:test 
(secondPartTestsExecution) @ hbase ---
[INFO] Tests are skipped.
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
{noformat}

Also, I can see these MR tests are failing in previous builds as well 
[HBase-5529].

Will check more. 

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5608) MR testcases are failing in QA builds

2012-03-21 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234279#comment-13234279
 ] 

Laxman commented on HBASE-5608:
---

Failing builds for logs
https://builds.apache.org/job/PreCommit-HBASE-Build/1231/
https://builds.apache.org/job/PreCommit-HBASE-Build/1112/
https://builds.apache.org/job/PreCommit-HBASE-Build/1108/

I had gone through the logs available in these builds. But I couldn't get any 
clue why these testcases are failing.

In case of TestImportTsv, MR job is failing quietly.

 MR testcases are failing in QA builds
 -

 Key: HBASE-5608
 URL: https://issues.apache.org/jira/browse/HBASE-5608
 Project: HBase
  Issue Type: Bug
  Components: build, mapreduce, test
Affects Versions: 0.92.2
 Environment: Hadoop QA - precommit builds
Reporter: Laxman
Priority: Blocker
  Labels: build-failure, mapreduce, test-fail

 Many of the MR testcases are failing in PreCommit builds (triggered by Hadoop 
 QA).
 Failing testcases are
 a) TestImportTsv
 b) TestHFileOutputFormat
 c) TestTableMapReduce

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234281#comment-13234281
 ] 

Laxman commented on HBASE-5564:
---

thanks for the info Ram.

I had spent sometime in analyzing these failures. But couldn't get a clue.
Filed a separate JIRA HBASE-5608 to fix these test failures.

As mentioned earlier all these test are passing in my local environment.

Should we wait for HBASE-5608 or proceed with review  commit?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235320#comment-13235320
 ] 

Laxman commented on HBASE-5564:
---

Ted, all these comments are related to line wrapping.
IMO, 80 characters length is too low  it makes the code bit ugly.

If you strongly feel we need to stick this 80-length, I will fix these comments.


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-21 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235342#comment-13235342
 ] 

Laxman commented on HBASE-5564:
---

Thanks Ted, for taking pain in getting the lint comments.
As you suggested, I will start a discussion on dev@hbase.

I just wanted to quote one example from this patch here.
{code}
long timstamp = conf.getLong(TIMESTAMP_CONF_KEY, 
System.currentTimeMillis());
{code}

Above code snippet after formatting, it turned to
{code}
long timstamp = conf
.getLong(TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}


 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Fix For: 0.96.0

 Attachments: 5564.lint, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.1.patch, HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233283#comment-13233283
]

Laxman commented on HBASE-5564:
---

bq. Doing this will use the same TS across all the mappers. Is this the
intention for this change? So in TsvImporterMapper,
conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, 0) will always have value to get
from conf.

Yes Anoop. we should have same timestamp for all mappers.
Please check my previous comments on the scope of the issue.

https://issues.apache.org/jira/browse/HBASE-5564?focusedCommentId=13228297page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13228297

Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233356#comment-13233356
 ] 

Laxman commented on HBASE-5564:
---

Any idea why QA bot is not testing this patch?
Can someone trigger this explicitly?

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader
 Attachments: HBASE-5564_trunk.1.patch, HBASE-5564_trunk.1.patch, 
 HBASE-5564_trunk.patch


 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-20 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234082#comment-13234082
]

Laxman commented on HBASE-5564:
---

All MR tests seems to be failing. Failures are not because of the patch.
I will check these failures.

@anoop
In bulkload, if multiple records are having same timestamp, then the last KV
entry processed by reducer only will be persisted (TreeSet in Reducer). I don't
see this as behavior inconsistency. Bulkload can't judge which KV entry to be
retained (Considering duplicate records exists across input splits/files). So,
in this case, user can develop custom MR to achieve this functionality.

Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228297#comment-13228297
]

Laxman commented on HBASE-5564:
---

Scope of this issue.

1) Avoid the behavioral inconsistency with timestamp parameter.

{noformat}
Currently in code,
a) If timstamp parameter is configured, duplicate records will be overwritten.
b) If not configured, some duplicate records are maintained as different
version.
{noformat}

This fix should be inline with the expectation Todd has mentioned.

bq. The whole point is that, in a bulk-load-only workflow, you can identify
each bulk load exactly, and correlate it to the MR job that inserted it.

2) Provide an option to look up timestamp column value from input data. (Like
ROWKEY column)
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY,
emp:name,emp:sal,dept:code'

I will submit the patch with the above mentioned approach.

Any other addons?

Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228406#comment-13228406
 ] 

Laxman commented on HBASE-5564:
---

While testing the patch in local, I'm getting the following error in trunk.
Any hints on this please?

{noformat}
java.lang.RuntimeException: java.io.IOException: Call to localhost/127.0.0.1:0 
failed on local exception: java.net.BindException: Cannot assign requested 
address: no further information
at 
org.apache.hadoop.mapred.MiniMRCluster.waitUntilIdle(MiniMRCluster.java:323)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:524)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:462)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:454)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:446)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:436)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:426)
at org.apache.hadoop.mapred.MiniMRCluster.init(MiniMRCluster.java:417)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1269)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniMapReduceCluster(HBaseTestingUtility.java:1255)
at 
org.apache.hadoop.hbase.mapreduce.TestImportTsv.doMROnTableTest(TestImportTsv.java:189)
at 
org.apache.hadoop.hbase.mapreduce.TestImportTsv.testMROnTable(TestImportTsv.java:162)
{noformat}

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-13 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228936#comment-13228936
 ] 

Laxman commented on HBASE-5564:
---

Thanks Stack. Let me give a try.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227597#comment-13227597
 ] 

Laxman commented on HBASE-5564:
---

I think this is a bug and its not any intentional behavior. 

Usage of TreeSet in the below code snippet is causing the issue.

PutSortReducer.reduce()
==
  TreeSetKeyValue map = new TreeSetKeyValue(KeyValue.COMPARATOR);
  long curSize = 0;
  // stop at the end or the RAM threshold
  while (iter.hasNext()  curSize  threshold) {
Put p = iter.next();
for (ListKeyValue kvs : p.getFamilyMap().values()) {
  for (KeyValue kv : kvs) {
map.add(kv);
curSize += kv.getLength();
  }
}

Changing this back to List and then sort explicitly will solve the issue.

 Bulkload is discarding duplicate records
 

 Key: HBASE-5564
 URL: https://issues.apache.org/jira/browse/HBASE-5564
 Project: HBase
  Issue Type: Bug
  Components: mapreduce
Affects Versions: 0.90.7, 0.92.2, 0.94.0, 0.96.0
 Environment: HBase 0.92
Reporter: Laxman
Assignee: Laxman
  Labels: bulkloader

 Duplicate records are getting discarded when duplicate records exists in same 
 input file and more specifically if they exists in same split.
 Duplicate records are considered if the records are from diffrent different 
 splits.
 Version under test: HBase 0.92

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227678#comment-13227678
]

Laxman commented on HBASE-5564:
---

I tested again with the proposed patch.
Changing this back to List and then sort explicitly will solve the issue.

Still the same problem persists making this issue bit more complicated.
I think the usage of same timestamp for all records in split causing the issue.

Currently in code,
a) If configured, we are using static timestamp for all mappers.
b) If not configured, we are using current system time generated for each split.

TsvImporterMapper.doSetup

{code}
ts = conf.getLong(ImportTsv.TIMESTAMP_CONF_KEY, System.currentTimeMillis());
{code}

Should we think of an approach to generate a unique sequence number and use it
as a timestamp?

Any other thoughts?

Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

2012-03-12 Thread Laxman (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228212#comment-13228212
]

Laxman commented on HBASE-5564:
---

bq. ts++, or ts--, could be an option?

ts++ or ts-- will not solve this problem. Reason being each mapper spawns a new
JVM and ts will be reset to initial value. so, still there is a chance of ts
collision.

bq. that the timestamps are all identical. The whole point is that, in a
bulk-load-only workflow, you can identify each bulk load exactly, and correlate
it to the MR job that inserted it.

No Todd. At least the implementation is buggy enough and not matching with this
expected behavior.
New timestamp is generated for each map task (i.e., for each split) in
TsvImporterMapper.doSetup.
Please check my previous comments.

bq. So this is only about ImportTsv? Should change the title in that case.
I'm not aware what other tools comes under bulkload. Bulkload documentation
talks only about importtsv.
http://hbase.apache.org/bulk-loads.html

But if you feel we should change the title, feel free to modify the title.

bq. If you want to use custom timestamps, you should specify a timestamp column
in your data, or write your own MR job (ImportTsv is just an example which use
useful for some cases, but for anything advanced I would expect users to write
their own code)

I think we can provide the provision to specify the timestamp column (Like
ROWKEY column) as arguments.
Example : importtsv.columns='HBASE_ROW_KEY, HBASE_TS_KEY,
emp:name,emp:sal,dept:code'

This makes importtsv more usable. Otherwise, user has to copy paste entire
importtsv code and do this minor modification.

Please let me know your suggestions on this.

Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5531) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot

2012-03-06 Thread Laxman (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223223#comment-13223223
 ] 

Laxman commented on HBASE-5531:
---

This patch involves build xml (pom.xml) changes only.
Above -1s are irrelevant to the changes.


 Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot
 -

 Key: HBASE-5531
 URL: https://issues.apache.org/jira/browse/HBASE-5531
 Project: HBase
  Issue Type: Bug
  Components: build
Affects Versions: 0.92.2
Reporter: Laxman
  Labels: build
 Fix For: 0.92.2, 0.96.0

 Attachments: HBASE-5531-trunk.patch, HBASE-5531.patch


 Current profile is still pointing to 0.23.1-SNAPSHOT. 
 This is failing to build as 23.1 is already released and snapshot is not 
 available anymore.
 We can update this to 0.23.2-SNAPSHOT.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-1697) Discretionary access control

[jira] [Commented] (HBASE-1697) Discretionary access control

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-4565) Maven HBase build broken on cygwin with copynativelib.sh call.

[jira] [Commented] (HBASE-5640) bulk load runs slowly than before

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5608) MR testcases are failing in QA builds

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5564) Bulkload is discarding duplicate records

[jira] [Commented] (HBASE-5531) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot

26 matches

Site Navigation

Mail list logo

Footer information