[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-09 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319561#comment-16319561
 ] 

Duo Zhang commented on HBASE-19731:
---

OK, then I think it is worth to make a temporary work around first. Will open a 
new issue for it.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-09 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16319551#comment-16319551
 ] 

stack commented on HBASE-19731:
---

bq. So what is the plan for HLC? In which version will it land?

Unfortunately no one working on it at mo. Was stalled by need to improve perf 
(crossing a synchronize getting unique timestamp) and then our intern who was 
working on it has moved on.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317528#comment-16317528
 ] 

Duo Zhang commented on HBASE-19731:
---

So what is the plan for HLC? In which version will it land? If it is not too 
far away I think it is OK to keep the code as is.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317512#comment-16317512
 ] 

stack commented on HBASE-19731:
---

bq. What do you think stack?

Each edit having its own unique timestamp would solve a bunch of mysterious 
issues we see in tests but also in prod.

Your suggestion is nice and straightforward.

I hate having a running counter already that is close to what is needed here -- 
MVCC -- and then a system that is almost done -- HLC -- that would fix this 
issue in a manner that could be depended upon cluster-wide never mind inside 
region-scope only. HLC is so close... 

What you think [~Apache9]


> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317495#comment-16317495
 ] 

Duo Zhang commented on HBASE-19731:
---

{quote}
Failing that, a timestamper like that in patch would be a limit of about 1k ops 
a second?
{quote}
It is region level, and is only for write, so 1k ops per second is fast enough? 
No? Since the problem only appears on row level, we can simply do a sharding? 
For example, using 1024 AtomicLongs, then the qps limit can be up to 1M.

What do you think [~stack]?

Thanks.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16317141#comment-16317141
 ] 

Hudson commented on HBASE-19731:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4365 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/4365/])
HBASE-19731 TestFromClientSide#testCheckAndDeleteWithCompareOp and (stack: rev 
2509a150c0792e914429264453510b9028250c29)
* (add) 
hbase-common/src/test/java/org/apache/hadoop/hbase/util/NonRepeatedEnvironmentEdge.java
* (edit) 
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java


> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 2.0.0-beta-1
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316734#comment-16316734
 ] 

stack commented on HBASE-19731:
---

Patch fixes the test for sure. +1.

On making this test timestamper default, we can't, right? Proper fix is HLC. 
Failing that, a timestamper like that in patch would be a limit of about 1k ops 
a second? And the checkAndSet for time is costly? We'd have to be parsimonious 
about checking time (currently we do it all over the code base w/o regard for 
cost).

It looks like the test fails in same place in alpha-4 so my thought that it new 
to beta-1 doesn't hold. Makes sense. I don't see it in the general flakies 
list: 
https://builds.apache.org/job/HBASE-Find-Flaky-Tests/lastSuccessfulBuild/artifact/dashboard.html
  probably because apache jenkins is slow overall... slower than my local 
machine or JMS's (or yours).

Thanks for jumping in here [~Apache9] and confirming speculation on root issue 
(would have taken me way longer to figure...)



> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16316536#comment-16316536
 ] 

stack commented on HBASE-19731:
---

This is great [~Apache9]. I've been running tests to see if this new since 
alpha4 We used to have a means of not ensuring same ts on update Let me 
find that too. Will be back. Nice work sir.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-08 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315858#comment-16315858
 ] 

Duo Zhang commented on HBASE-19731:
---

[~stack] Do you think we need to make this logic default? I mean, implement the 
same logic in each region, which will not assign the same timestamp twice?

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-2
>
> Attachments: HBASE-19731.patch
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-07 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315817#comment-16315817
 ] 

Duo Zhang commented on HBASE-19731:
---

[~stack] I think the problem is we assigned the same timestamp twice.

I added a static tsAssigned field in HRegion

{code}
  public static volatile List tsAssigned;
{code}

And in MutationBatchOperation.prepareMiniBatchOperations, I did this
{code}
  if (!region.getRegionInfo().isMetaRegion() && HRegion.tsAssigned != null) 
{
HRegion.tsAssigned.add(timestamp);
  }
{code}

And I also modified the UT
{code}
  @Test
  public void test() throws IOException {
try {
  for (int i = 0; i < 100; i++) {
testCheckAndDeleteWithCompareOp();
TEST_UTIL.deleteTable(TableName.valueOf(name.getMethodName()));
HRegion.tsAssigned = null;
  }
} catch (AssertionError e) {
  HRegion.tsAssigned.forEach(System.out::println);
  throw e;
}
  }
{code}

Notice that I will create HRegion.tsAssigned in testCheckAndDeleteWithCompareOp 
after the creation of the test table.

And finally I got this output
{noformat}
1515397552529
1515397552533
1515397552535
1515397552537
1515397552539
1515397552541
1515397552543
1515397552546
1515397552547
1515397552548
1515397552549
1515397552550
1515397552551
1515397552554
1515397552555
1515397552556
1515397552556
{noformat}

You can see that the test fails immediately after we issue the same ts again.

This means we are doing faster mutation for beta1 so it is more easier to run 
into this situation? Maybe a good news...

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-2
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19731) TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are flakey

2018-01-07 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315797#comment-16315797
 ] 

Duo Zhang commented on HBASE-19731:
---

I wrote a method to loop testCheckAndDeleteWithCompareOp and can make the test 
fail.

{code}
  @Test
  public void test() throws IOException {
for (int i = 0; i < 100; i++) {
  testCheckAndDeleteWithCompareOp();
  TEST_UTIL.deleteTable(TableName.valueOf(name.getMethodName()));
}
  }
{code}

Let me dig more.

> TestFromClientSide#testCheckAndDeleteWithCompareOp and testNullQualifier are 
> flakey
> ---
>
> Key: HBASE-19731
> URL: https://issues.apache.org/jira/browse/HBASE-19731
> Project: HBase
>  Issue Type: Sub-task
>  Components: test
>Reporter: stack
>Assignee: stack
>Priority: Critical
> Fix For: 2.0.0-beta-2
>
>
> These two tests fail frequently locally; rare does this suite pass.
> The failures are either of these two tests.  Unfortunately, running the test 
> standalone does not bring  on the issue; need to run the whole suite.
> In both cases, we have a Delete followed by a Put and then a checkAnd* -type 
> operation which does a Get expecting to find the just put Put but it fails on 
> occasion.
> Looks to be an mvcc issues or Put going in at same timestamp as the Delete. 
> Its hard to debug given any added logging seems to make it all pass again.
> Seems this too is new in beta-1. Running tests against alpha-4 seem to pass.
> Doing a compare



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)