[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21463:
--
Attachment: HBASE-21463.patch

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21463:
--
Assignee: Duo Zhang
  Status: Patch Available  (was: Open)

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch, HBASE-21463.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-10 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682754#comment-16682754
 ] 

Allan Yang commented on HBASE-21423:


Uploaded an addendum to resolve some remaining issues.

> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch, 
> HBASE-21423.branch-2.0.addendum.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang updated HBASE-21423:
---
Attachment: HBASE-21423.branch-2.0.addendum.patch

> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch, 
> HBASE-21423.branch-2.0.addendum.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers

2018-11-10 Thread Allan Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allan Yang reopened HBASE-21423:


> Procedures for meta table/region should be able to execute in separate 
> workers 
> ---
>
> Key: HBASE-21423
> URL: https://issues.apache.org/jira/browse/HBASE-21423
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21423.branch-2.0.001.patch, 
> HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch
>
>
> We have higher priority for meta table procedures, but only in queue level. 
> There is a case that the meta table is closed and a AssignProcedure(or RTSP 
> in branch-2+) is waiting there to be executed, but at the same time, all the 
> Work threads are executing procedures need to write to meta table, then all 
> the worker will be stuck and retry for writing meta, no worker will take the 
> AP for meta.
> Though we have a mechanism that will detect stuck and adding more 
> ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a 
> long time.
> This is a real case I encountered in ITBLL.
> So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta 
> procedures(other workers can take meta procedures too), which can resolve 
> this kind of stuck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21401) Sanity check in BaseDecoder#parseCell

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682732#comment-16682732
 ] 

stack commented on HBASE-21401:
---

Sorry. Late to this. I think the extra parse this patch adds the wrong 
approach. Let checksum find corruption in streams. Add a check up front to make 
sure only good types get entered in the system. If a bug in the decoder, fix 
that rather than 'check' all values all the time? [~openinx].

> Sanity check in BaseDecoder#parseCell
> -
>
> Key: HBASE-21401
> URL: https://issues.apache.org/jira/browse/HBASE-21401
> Project: HBase
>  Issue Type: Sub-task
>  Components: regionserver
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Critical
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21401.v1.patch, HBASE-21401.v2.patch, 
> HBASE-21401.v3.patch, HBASE-21401.v4.patch, HBASE-21401.v4.patch, 
> HBASE-21401.v5.patch
>
>
> In KeyValueDecoder & ByteBuffKeyValueDecoder,  we pass a byte buffer to 
> initialize the Cell without a sanity check (check each field's offset 
> exceed the byte buffer or not), so ArrayIndexOutOfBoundsException may happen 
> when read the cell's fields, such as HBASE-21379,  it's hard to debug this 
> kind of bug. 
> An earlier check will help to find such kind of bugs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21410) A helper page that help find all problematic regions and procedures

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682730#comment-16682730
 ] 

stack commented on HBASE-21410:
---

[~tianjingyun] This patch looks really good. Give me a day or so. I want to try 
it. Having it in 2.1 only is fine I think. Thanks.

> A helper page that help find all problematic regions and procedures
> ---
>
> Key: HBASE-21410
> URL: https://issues.apache.org/jira/browse/HBASE-21410
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0, 2.2.0, 2.1.1
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21410.branch-2.1.001.patch, 
> HBASE-21410.branch-2.1.002.patch, HBASE-21410.master.001.patch, 
> HBASE-21410.master.002.patch, HBASE-21410.master.003.patch, 
> HBASE-21410.master.004.patch, Screenshot from 2018-10-30 19-06-21.png, 
> Screenshot from 2018-10-30 19-06-42.png, Screenshot from 2018-10-31 
> 10-11-38.png, Screenshot from 2018-10-31 10-11-56.png, Screenshot from 
> 2018-11-01 17-56-02.png, Screenshot from 2018-11-01 17-56-15.png
>
>
> *This page is mainly focus on finding the regions stuck in some state that 
> cannot be assigned. My proposal of the page is as follows:*
> !Screenshot from 2018-10-30 19-06-21.png!
> *From this page we can see all regions in RIT queue and their related 
> procedures. If we can determine that these regions' state are abnormal, we 
> can click the link 'Procedures as TXT' to get a full list of procedure IDs to 
> bypass them. Then click 'Regions as TXT' to get a full list of encoded region 
> names to assign.*
> !Screenshot from 2018-10-30 19-06-42.png!
> *Some region names are covered by the navigator bar, I'll fix it later.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682724#comment-16682724
 ] 

Zheng Hu commented on HBASE-21445:
--

Got it,  Thanks [~stack] . 

> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20952) Re-visit the WAL API

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682723#comment-16682723
 ] 

Hudson commented on HBASE-20952:


Results for branch HBASE-20952
[build #46 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Re-visit the WAL API
> 
>
> Key: HBASE-20952
> URL: https://issues.apache.org/jira/browse/HBASE-20952
> Project: HBase
>  Issue Type: Improvement
>  Components: wal
>Reporter: Josh Elser
>Priority: Major
> Attachments: 20952.v1.txt
>
>
> Take a step back from the current WAL implementations and think about what an 
> HBase WAL API should look like. What are the primitive calls that we require 
> to guarantee durability of writes with a high degree of performance?
> The API needs to take the current implementations into consideration. We 
> should also have a mind for what is happening in the Ratis LogService (but 
> the LogService should not dictate what HBase's WAL API looks like RATIS-272).
> Other "systems" inside of HBase that use WALs are replication and 
> backup Replication has the use-case for "tail"'ing the WAL which we 
> should provide via our new API. B doesn't do anything fancy (IIRC). We 
> should make sure all consumers are generally going to be OK with the API we 
> create.
> The API may be "OK" (or OK in a part). We need to also consider other methods 
> which were "bolted" on such as {{AbstractFSWAL}} and 
> {{WALFileLengthProvider}}. Other corners of "WAL use" (like the 
> {{WALSplitter}} should also be looked at to use WAL-APIs only).
> We also need to make sure that adequate interface audience and stability 
> annotations are chosen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682722#comment-16682722
 ] 

stack commented on HBASE-21461:
---

Lets do option #2. I can help. Its too early (or too late -- smile) for #3.  I 
should try adding the patch here over in the tools repo?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682721#comment-16682721
 ] 

stack commented on HBASE-21463:
---

My bad.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682718#comment-16682718
 ] 

stack commented on HBASE-20604:
---

Understood [~apurtell] New (sub-)issue sounds good. Just noting reviewer asked 
question back in May and it looked like it went unanswered.

> ProtobufLogReader#readNext can incorrectly loop to the same position in the 
> stream until the the WAL is rolled
> --
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0
>Reporter: Esteban Gutierrez
>Assignee: Esteban Gutierrez
>Priority: Critical
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, 
> HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream 
> associated to the {{FSDataInputStream}} from the WAL that we are reading. 
> Under certain conditions, e.g. when using the encryption at rest 
> ({{CryptoInputStream}}) the stream can return partial data which can cause a 
> premature EOF that cause {{inputStream.getPos()}} to return to the same 
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the 
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck 
> until the WAL is rolled and causing replication delays up to an hour in some 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682678#comment-16682678
 ] 

Duo Zhang commented on HBASE-21463:
---

Yes, only a UT(I have a 'UT' in the patch name...). Will provide a patch soon, 
as said above, change the behavior of checkOnlineRegions.

And I believe the race could also happen for branch-2.1 and branch-2.0, 
HBASE-21421 is one possible problem. The issue described here may not be a 
problem as the behavior of AssignProcedure/UnassignProcedure maybe different 
from TRSP, but I'm afraid there could be other strange problems.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682677#comment-16682677
 ] 

Duo Zhang commented on HBASE-20604:
---

I'm OK with opening a new issue to address the remaining problems, and I can 
take the charge but the problem is that, no one told me what is the real 
problem... And no failing UT is not a strong reason as I believe that the code 
we added here will not be executed in our existing UTs...

The description is not very clear on what is going on, I would like to see more 
detailed explanation, better to point out the problematic code in 
CryptoInputStream. Is it the one in hadoop or in apache commons? Is there an 
existing jira abort it?

Thanks.

> ProtobufLogReader#readNext can incorrectly loop to the same position in the 
> stream until the the WAL is rolled
> --
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0
>Reporter: Esteban Gutierrez
>Assignee: Esteban Gutierrez
>Priority: Critical
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, 
> HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream 
> associated to the {{FSDataInputStream}} from the WAL that we are reading. 
> Under certain conditions, e.g. when using the encryption at rest 
> ({{CryptoInputStream}}) the stream can return partial data which can cause a 
> premature EOF that cause {{inputStream.getPos()}} to return to the same 
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the 
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck 
> until the WAL is rolled and causing replication delays up to an hour in some 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682658#comment-16682658
 ] 

Hudson commented on HBASE-21376:


Results for branch branch-2
[build #1495 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-10 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682644#comment-16682644
 ] 

Ted Yu commented on HBASE-21387:


One more note about why I choose 21387.v9.txt as the version for review:

priority is given to taking snapshot versus (delaying) cleaning snapshot files.
This is because a failed snapshot has higher visibility compared to delayed 
snapshot cleaning.



> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, 
> two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-10 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682638#comment-16682638
 ] 

Ted Yu commented on HBASE-21246:


Currently there are about 69 failing test classes.

Working through these failed tests.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled

2018-11-10 Thread Andrew Purtell (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682571#comment-16682571
 ] 

Andrew Purtell commented on HBASE-20604:


I have no strong opinion, could go either way. Certainly better late than never 
for more review. However it’s not great for a contributor to have us complete a 
review, and it was completed, see above, and then have a commit after due 
testing, which we also have, only to see the issue reopened. I argue this is 
suboptimal process. A better alternative is a new Jira and the reviewer who 
volunteered more suggestions can be given the option to perform the work on the 
new jira, or maybe Esteban would be interested. For the sake of predictability 
in our process. 

> ProtobufLogReader#readNext can incorrectly loop to the same position in the 
> stream until the the WAL is rolled
> --
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0
>Reporter: Esteban Gutierrez
>Assignee: Esteban Gutierrez
>Priority: Critical
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, 
> HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream 
> associated to the {{FSDataInputStream}} from the WAL that we are reading. 
> Under certain conditions, e.g. when using the encryption at rest 
> ({{CryptoInputStream}}) the stream can return partial data which can cause a 
> premature EOF that cause {{inputStream.getPos()}} to return to the same 
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the 
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck 
> until the WAL is rolled and causing replication delays up to an hour in some 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682563#comment-16682563
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

{quote}It could be here on this JIRA w/ instructions on how to build. Might be 
ok given limited audience... but wouldn't encourage confidence in the 'hosed' 
operator.
{quote}
Maybe have a built jar available here with install instructions. I guess 
requiring a whole build environment setup would be too discouraging for 
admins/operators.
{quote}Or it'd be in tools repo... Your plan for a replication submodule sounds 
good. In it would be a submodule for this cp ... setting the jdk7 compile 
target and having dependency on branch-1.
{quote}
This approach benefit is that we start tide up the house and put most of 
support/operations "hacks" in its specific shelves. BTW, should we have another 
jira/thread to discuss what else could be moved to 
"/operator-tools/replication" submodule (assuming there's none yet)?
{quote}Or, we start the cp 'store' repo... where we start putting cps. 
(smile).
{quote}
That could be another way to organise extra tools/features. Are there other CPs 
planned to be moved out of hbase main project?

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682561#comment-16682561
 ] 

stack commented on HBASE-20604:
---

[~esteban] See above. Should we reopen this issue since outstanding review?

> ProtobufLogReader#readNext can incorrectly loop to the same position in the 
> stream until the the WAL is rolled
> --
>
> Key: HBASE-20604
> URL: https://issues.apache.org/jira/browse/HBASE-20604
> Project: HBase
>  Issue Type: Bug
>  Components: Replication, wal
>Affects Versions: 3.0.0
>Reporter: Esteban Gutierrez
>Assignee: Esteban Gutierrez
>Priority: Critical
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, 
> HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch
>
>
> Every time we call {{ProtobufLogReader#readNext}} we consume the input stream 
> associated to the {{FSDataInputStream}} from the WAL that we are reading. 
> Under certain conditions, e.g. when using the encryption at rest 
> ({{CryptoInputStream}}) the stream can return partial data which can cause a 
> premature EOF that cause {{inputStream.getPos()}} to return to the same 
> origina position causing {{ProtobufLogReader#readNext}} to re-try over the 
> reads until the WAL is rolled.
> The side effect of this issue is that {{ReplicationSource}} can get stuck 
> until the WAL is rolled and causing replication delays up to an hour in some 
> cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21454) Kill zk spew

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682559#comment-16682559
 ] 

stack commented on HBASE-21454:
---

I would like us to take control of what gets logged and when. For too long, zk 
has been over-sharing at INFO level anytime we start an hbase process. I like 
the [~busbey] suggestion that what is here is too radical that we need to add 
back some of what zk was doing but under our terms and by our classes rather 
than when zk wants to. Let me make a v2.

> Kill zk spew
> 
>
> Key: HBASE-21454
> URL: https://issues.apache.org/jira/browse/HBASE-21454
> Project: HBase
>  Issue Type: Bug
>  Components: logging, Zookeeper
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21454.master.001.patch
>
>
> Kill the zk spew. This is radical dropping startup listing of CLASSPATH and 
> all properties. Can dial back-in what we need after this patch goes in.
> I get spew each time I run a little command in spark-shell. Annoying. Always 
> been annoying in all logs.
> More might be needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682555#comment-16682555
 ] 

stack commented on HBASE-21445:
---

+1 on the patch.

In future, consider leaving out reformatting changes like those in this 
patch... They seem arbitrary... and bulk up what could have been a two liner.

Otherwise, very nice patch. Thanks [~openinx]

> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682554#comment-16682554
 ] 

stack commented on HBASE-21463:
---

Is there a change in the patch? I see making a method visible for testing, a 
reformat of a log message, and a nice looking UT but where is the fix?

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

2018-11-10 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682544#comment-16682544
 ] 

Josh Elser commented on HBASE-21440:


Ah, I think I see what the issue is. The change to AP#remoteCallFailed to 
return false would leave the AP suspended whereas before it will wake up the 
procedure again (see UP#remoteCallFailed). Might need to rework out we 
propagate the success/failure back up the stack.

> Assign procedure on the crashed server is not properly interrupted
> --
>
> Key: HBASE-21440
> URL: https://issues.apache.org/jira/browse/HBASE-21440
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: HBASE-21440.branch-2.0.001.patch, 
> HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch
>
>
> When the server crashes, it's SCP checks if there is already a procedure 
> assigning the region on this crashed server. If we found one, SCP will just 
> interrupt the already running AssignProcedure by calling remoteCallFailed 
> which internally just changes the region node state to OFFLINE and send the 
> procedure back with transition queue state for assignment with a new plan. 
> But, due to the race condition between the calling of the remoteCallFailed 
> and current state of the already running assign 
> procedure(REGION_TRANSITION_FINISH: where the region is already opened), it 
> is possible that assign procedure goes ahead in updating the regionStateNode 
> to OPEN on a crashed server. 
> As SCP had already skipped this region for assignment as it was relying on 
> existing assign procedure to do the right thing, this whole confusion leads 
> region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682538#comment-16682538
 ] 

stack commented on HBASE-21461:
---

It could be here on this JIRA w/ instructions on how to build. Might be ok 
given limited audience... but wouldn't encourage confidence in the 'hosed' 
operator.

Or it'd be in tools repo... Your plan for a replication submodule sounds good. 
In it would be a submodule for this cp ... setting the jdk7 compile target and 
having dependency on branch-1.

Or, we start the cp 'store' repo... where we start putting cps. (smile).



> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC issue scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted

2018-11-10 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682532#comment-16682532
 ] 

Josh Elser commented on HBASE-21440:


Let me try to put up a v4 :)

> Assign procedure on the crashed server is not properly interrupted
> --
>
> Key: HBASE-21440
> URL: https://issues.apache.org/jira/browse/HBASE-21440
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.2
>Reporter: Ankit Singhal
>Assignee: Ankit Singhal
>Priority: Major
> Attachments: HBASE-21440.branch-2.0.001.patch, 
> HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch
>
>
> When the server crashes, it's SCP checks if there is already a procedure 
> assigning the region on this crashed server. If we found one, SCP will just 
> interrupt the already running AssignProcedure by calling remoteCallFailed 
> which internally just changes the region node state to OFFLINE and send the 
> procedure back with transition queue state for assignment with a new plan. 
> But, due to the race condition between the calling of the remoteCallFailed 
> and current state of the already running assign 
> procedure(REGION_TRANSITION_FINISH: where the region is already opened), it 
> is possible that assign procedure goes ahead in updating the regionStateNode 
> to OPEN on a crashed server. 
> As SCP had already skipped this region for assignment as it was relying on 
> existing assign procedure to do the right thing, this whole confusion leads 
> region to a not accessible state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-10 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682531#comment-16682531
 ] 

Josh Elser edited comment on HBASE-21457 at 11/10/18 6:10 PM:
--

{quote}Looking for where the timeout should be increased.
{quote}
If it's not explicit, maybe it's inherited from the test-categorization?

{quote}
 200 seconds were really long. Though I don't see meaningful exception in test 
output related to master initialization.
{quote}

Other B tests seem to have run for over 600s. Is an exception not being 
logged correctly? Is there some timeout happening within hadoop mini clusters? 
There doesn't seem to be an obvious reason in HBase as to where this timeout is 
coming from, but you have the tools to dig in to figure out why this happens.


was (Author: elserj):
{quote}Looking for where the timeout should be increased.
{quote}
If it's not explicit, maybe it's inherited from the test-categorization?

{quote}
 200 seconds were really long. Though I don't see meaningful exception in test 
output related to master initialization.
{quote}

Other B tests seem to have run for over 600s. Is an exception not being 
logged correctly? Is there some timeout happening within hadoop mini clusters?

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-10 Thread Josh Elser (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682531#comment-16682531
 ] 

Josh Elser commented on HBASE-21457:


{quote}Looking for where the timeout should be increased.
{quote}
If it's not explicit, maybe it's inherited from the test-categorization?

{quote}
 200 seconds were really long. Though I don't see meaningful exception in test 
output related to master initialization.
{quote}

Other B tests seem to have run for over 600s. Is an exception not being 
logged correctly? Is there some timeout happening within hadoop mini clusters?

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682510#comment-16682510
 ] 

Hudson commented on HBASE-21376:


Results for branch branch-2.0
[build #1073 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682494#comment-16682494
 ] 

Hudson commented on HBASE-21376:


Results for branch branch-2.1
[build #594 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-13468) hbase.zookeeper.quorum supports ipv6 address

2018-11-10 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-13468:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, maoling

Thanks for the review, Mike

> hbase.zookeeper.quorum supports ipv6 address
> 
>
> Key: HBASE-13468
> URL: https://issues.apache.org/jira/browse/HBASE-13468
> Project: HBase
>  Issue Type: Bug
>Reporter: Mingtao Zhang
>Assignee: maoling
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-13468.master.001.patch, 
> HBASE-13468.master.002.patch, HBASE-13468.master.003.patch, 
> HBASE-13468.master.004.patch
>
>
> I put ipv6 address in hbase.zookeeper.quorum, by the time this string went to 
> zookeeper code, the address is messed up, i.e. only '[1234' left. 
> I started using pseudo mode with embedded zk = true.
> I downloaded 1.0.0, not sure which affected version should be here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Guanghao Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682410#comment-16682410
 ] 

Guanghao Zhang commented on HBASE-21463:


Great UT. Not easy to find this bug and represent this problem..

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682388#comment-16682388
 ] 

Duo Zhang commented on HBASE-21463:
---

A UT to represent the problem. [~zghaobac] FYI. If no other opinions, I will 
completely change the behavior of checkOnlineRegions to only report possible 
inconsistency instead of trying to fix them.

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21463:
--
Attachment: HBASE-21463-UT.patch

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21463-UT.patch
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682394#comment-16682394
 ] 

Hudson commented on HBASE-21445:


Results for branch master
[build #596 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/596/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console 
output|https://builds.apache.org/job/HBase%20Nightly/job/master/596//console].


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682393#comment-16682393
 ] 

Hudson commented on HBASE-21437:


Results for branch master
[build #596 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/596/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop3)/]


(x) {color:red}-1 source release artifact{color}
-- See build output for details.


(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console 
output|https://builds.apache.org/job/HBase%20Nightly/job/master/596//console].


> Bypassed procedure throw IllegalArgumentException when its state is 
> WAITING_TIMEOUT
> ---
>
> Key: HBASE-21437
> URL: https://issues.apache.org/jira/browse/HBASE-21437
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.1.1, 2.0.2
>Reporter: Jingyun Tian
>Assignee: Jingyun Tian
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21437.master.001.patch, 
> HBASE-21437.master.002.patch, HBASE-21437.master.003.patch
>
>
> {code}
> 2018-11-05,18:25:52,735 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating 
> UNNATURALLY null
> java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, 
> state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, 
> bypass=true; TransitRegionStateProcedure table=test_fail
> over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN
> at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948)
> 2018-11-05,18:25:52,736 TRACE 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated.
> {code}
> Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state 
> is still WAITING_TIMEOUT, then when executor run this procedure, it will 
> throw exception and cause worker terminated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single

2018-11-10 Thread Wellington Chevreuil (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682392#comment-16682392
 ] 

Wellington Chevreuil commented on HBASE-21461:
--

Thanks for the insights, [~stack]!
{quote}I agree it an operator-tool but it is a bit 'odd' being branch-1 only 
and a CP only (small audience – but super cool throwing these hosed operators a 
lifeline...).
{quote}
My thought was to have it as the first feature of "replication" sub-module from 
operator-tool. Any potential other utilities for replication operation related 
issues could be then placed there as well. The limited audience, though, might 
be indeed something to consider if it's really worth the effort for now.
{quote}How would we package it? Would we build a jar over in 
hbase-operator-tool and then operator would take it and install when they had a 
constipated replication stream?
{quote}
Yeah, operators would need to download (if we are planning to expose a download 
page for operator-tool) or build it and install it. Put that way, does not 
sound really like a tool, since it's not a simple matter of running an external 
application that interacts and fixes hbase problems. Maybe we should call it a 
"medicine" (a laxative one :)).
{quote}One other thought is that we add to the refguide a section on 
constipation (smile) w/ a pointer here w/ instructions on how to install.
{quote}
Liked this idea too. In this case where and how would we place the CP? Were you 
thinking on providing the builtin jar somewhere, or just the raw code in patch 
format attached to a jira? I tend to prefer the former, as a mean to cover a 
broader audience of operators that may not be familiar with the build process.

 

> Region CoProcessor for splitting large WAL entries in smaller batches, to 
> handle situation when faulty ingestion had created too many mutations for 
> same cell in single batch
> -
>
> Key: HBASE-21461
> URL: https://issues.apache.org/jira/browse/HBASE-21461
> Project: HBase
>  Issue Type: New Feature
>  Components: hbase-operator-tools, Replication
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt
>
>
> With replication enabled deployments, it's possible that faulty ingestion 
> clients may lead to single WalEntry containing too many edits for same cell. 
> This would cause *ReplicationSink,* in the target cluster, to attempt single 
> batch mutation with too many operations, what in turn can lead to very large 
> RPC requests, which may not fit in the final target RS rpc queue. In this 
> case, the messages below are seen on target RS trying to perform the sink:
> {noformat}
> WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, 
> attempt=4/4 failed=2ops, last exception: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException):
>  Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size 
> too small? on regionserver01.example.com,60020,1524334173359, tracking 
> started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
> 2018-09-07 10:40:59,506 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to 
> accept edit because:
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 
> actions: RemoteWithExtrasException: 2 times, 
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982)
> at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat}
> When this problem manifests, replication will be stuck and wal files will be 
> piling up on source cluster WALs/oldWALs folder. Typical workaround requires 
> manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL 
> files containing the large entry.
> This CP would handle the issue, by checking for large wal entries and 
> splitting those into smaller batches on the *reReplicateLogEntries* method 
> hook.
> *Additional Note*: HBASE-18027 introduced some safeguards such large RPC 
> requests, which may already help avoid such scenario. That is not available 
> for 1.2 releases, though, and this CP tool may still be relevant for 1.2 
> clusters. It may also be still worth having it to workaround any potential 
> unknown large RPC 

[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682361#comment-16682361
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-2.1
[build #593 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593/]: 
(/) *{color:green}+1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682356#comment-16682356
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-2.0
[build #1072 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//General_Nightly_Build_Report/]




(/) {color:green}+1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682351#comment-16682351
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-2
[build #1494 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//JDK8_Nightly_Build_Report_(Hadoop2)/]


(/) {color:green}+1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682349#comment-16682349
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-1.4
[build #543 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682320#comment-16682320
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-1.3
[build #536 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//General_Nightly_Build_Report/]


(/) {color:green}+1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682307#comment-16682307
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-1
[build #546 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//General_Nightly_Build_Report/]


(x) {color:red}-1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//JDK8_Nightly_Build_Report_(Hadoop2)/]




(x) {color:red}-1 source release artifact{color}
-- See build output for details.


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21463:
--
Priority: Critical  (was: Major)

> The checkOnlineRegionsReport can accidentally complete a TRSP
> -
>
> Key: HBASE-21463
> URL: https://issues.apache.org/jira/browse/HBASE-21463
> Project: HBase
>  Issue Type: Sub-task
>  Components: amv2
>Reporter: Duo Zhang
>Priority: Critical
> Fix For: 3.0.0, 2.2.0
>
>
> On our testing cluster, we observe a  race condition:
> 1. A regionServerReport request is built
> 2. A TRSP is scheduled to reopen the region
> 3. The region is closed at RS side
> 4. The OpenRegionProcedure is created
> 5. The regionServerReport generated at step 1 is executed, and we find that 
> the region is opened on the RS, so we update the region state to OPEN.
> 6. The OpenRegionProcedure notices that the region has already been in the 
> OPEN state so gives up  and finishes.
> 7. The TRSP finishes.
> 8. The region is recorded as OPEN on the RS but actually not, and can not 
> recover unless we use HBCK2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS

2018-11-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682306#comment-16682306
 ] 

Hudson commented on HBASE-21445:


Results for branch branch-1.2
[build #545 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//General_Nightly_Build_Report/]


(/) {color:green}+1 jdk7 checks{color}
-- For more information [see jdk7 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//JDK7_Nightly_Build_Report/]


(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//JDK8_Nightly_Build_Report_(Hadoop2)/]




(/) {color:green}+1 source release artifact{color}
-- See build output for details.


> CopyTable by bulkload will write hfile into yarn's HDFS 
> 
>
> Key: HBASE-21445
> URL: https://issues.apache.org/jira/browse/HBASE-21445
> Project: HBase
>  Issue Type: Bug
>  Components: mapreduce
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9
>
> Attachments: HBASE-21445.v1.patch
>
>
> When using CopyTable with bulkload, I found that all hfile's are written in 
> our Yarn's HDFS cluster.   and failed to load hfiles into HBase cluster, 
> because we use different HDFS between yarn cluster and hbase cluster. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21376:
--
Resolution: Fixed
  Assignee: Duo Zhang  (was: Allan Yang)
Status: Resolved  (was: Patch Available)

Pushed to branch-2.0+.

> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21377) Missing procedure stack index when restarting

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682302#comment-16682302
 ] 

Duo Zhang commented on HBASE-21377:
---

Never seen this for a long time. Plan to close this after the 
TestMergeTableRegionsProcedure is moved out from the flakey list.

> Missing procedure stack index when restarting
> -
>
> Key: HBASE-21377
> URL: https://issues.apache.org/jira/browse/HBASE-21377
> Project: HBase
>  Issue Type: Sub-task
>  Components: proc-v2
>Reporter: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21377-debuglog.patch
>
>
> TestMergeTableRegionsProcedure is still flakey, and found this in the output
> {noformat}
> 2018-10-24 03:46:12,842 ERROR [Time-limited test] wal.WALProcedureTree(198): 
> Missing stack id 6, max stack id is 8, root procedure is Procedure(pid=42, 
> ppid=-1, 
> class=org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure)
> 2018-10-24 03:46:12,847 ERROR [Time-limited test] 
> procedure2.ProcedureExecutor$2(451): Corrupt pid=42, 
> state=WAITING:MERGE_TABLE_REGIONS_CHECK_CLOSED_REGIONS, hasLock=false; 
> MergeTableRegionsProcedure table=testRollbackAndDoubleExecution, 
> regions=[72aed4d14ac73faaa1755e248a55b71a, a848f3ca26989865d59cd0683ae6], 
> forcibly=false
> 2018-10-24 03:46:12,847 ERROR [Time-limited test] 
> procedure2.ProcedureExecutor$2(451): Corrupt pid=43, ppid=42, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; 
> TransitRegionStateProcedure table=testRollbackAndDoubleExecution, 
> region=72aed4d14ac73faaa1755e248a55b71a, UNASSIGN
> 2018-10-24 03:46:12,848 ERROR [Time-limited test] 
> procedure2.ProcedureExecutor$2(451): Corrupt pid=44, ppid=42, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; 
> TransitRegionStateProcedure table=testRollbackAndDoubleExecution, 
> region=a848f3ca26989865d59cd0683ae6, UNASSIGN
> 2018-10-24 03:46:12,848 ERROR [Time-limited test] 
> procedure2.ProcedureExecutor$2(451): Corrupt pid=45, ppid=43, state=SUCCESS, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure
> 2018-10-24 03:46:12,849 ERROR [Time-limited test] 
> procedure2.ProcedureExecutor$2(451): Corrupt pid=46, ppid=44, state=RUNNABLE, 
> hasLock=false; org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure
> {noformat}
> Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21376:
--
Fix Version/s: 2.2.0
   3.0.0

> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler

2018-11-10 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-21376:
--
Component/s: proc-v2
 logging

> Add some verbose log to MasterProcedureScheduler
> 
>
> Key: HBASE-21376
> URL: https://issues.apache.org/jira/browse/HBASE-21376
> Project: HBase
>  Issue Type: Sub-task
>  Components: logging, proc-v2
>Reporter: Allan Yang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2
>
> Attachments: HBASE-21376.branch-2.0.001.patch, 
> HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch
>
>
> As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the 
> critical one is already submitted in HBASE-21364 to branch-2.0 and 
> branch-2.1, but I also added some useful logs  which need to commit to all 
> branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21454) Kill zk spew

2018-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682293#comment-16682293
 ] 

Duo Zhang commented on HBASE-21454:
---

I think we should find a way to disable all logs when executing shell? But for 
a running master or regionserver instance the log is useful...

> Kill zk spew
> 
>
> Key: HBASE-21454
> URL: https://issues.apache.org/jira/browse/HBASE-21454
> Project: HBase
>  Issue Type: Bug
>  Components: logging, Zookeeper
>Reporter: stack
>Assignee: stack
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: HBASE-21454.master.001.patch
>
>
> Kill the zk spew. This is radical dropping startup listing of CLASSPATH and 
> all properties. Can dial back-in what we need after this patch goes in.
> I get spew each time I run a little command in spark-shell. Annoying. Always 
> been annoying in all logs.
> More might be needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)