[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21463: -- Attachment: HBASE-21463.patch > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch, HBASE-21463.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21463: -- Assignee: Duo Zhang Status: Patch Available (was: Open) > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch, HBASE-21463.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682754#comment-16682754 ] Allan Yang commented on HBASE-21423: Uploaded an addendum to resolve some remaining issues. > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch, > HBASE-21423.branch-2.0.addendum.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang updated HBASE-21423: --- Attachment: HBASE-21423.branch-2.0.addendum.patch > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch, > HBASE-21423.branch-2.0.addendum.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (HBASE-21423) Procedures for meta table/region should be able to execute in separate workers
[ https://issues.apache.org/jira/browse/HBASE-21423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allan Yang reopened HBASE-21423: > Procedures for meta table/region should be able to execute in separate > workers > --- > > Key: HBASE-21423 > URL: https://issues.apache.org/jira/browse/HBASE-21423 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.1, 2.0.2 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21423.branch-2.0.001.patch, > HBASE-21423.branch-2.0.002.patch, HBASE-21423.branch-2.0.003.patch > > > We have higher priority for meta table procedures, but only in queue level. > There is a case that the meta table is closed and a AssignProcedure(or RTSP > in branch-2+) is waiting there to be executed, but at the same time, all the > Work threads are executing procedures need to write to meta table, then all > the worker will be stuck and retry for writing meta, no worker will take the > AP for meta. > Though we have a mechanism that will detect stuck and adding more > ''KeepAlive'' workers to the pool to resolve the stuck. It is already stuck a > long time. > This is a real case I encountered in ITBLL. > So, I add one 'Urgent work' to the ProceudureExecutor, which only take meta > procedures(other workers can take meta procedures too), which can resolve > this kind of stuck. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21401) Sanity check in BaseDecoder#parseCell
[ https://issues.apache.org/jira/browse/HBASE-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682732#comment-16682732 ] stack commented on HBASE-21401: --- Sorry. Late to this. I think the extra parse this patch adds the wrong approach. Let checksum find corruption in streams. Add a check up front to make sure only good types get entered in the system. If a bug in the decoder, fix that rather than 'check' all values all the time? [~openinx]. > Sanity check in BaseDecoder#parseCell > - > > Key: HBASE-21401 > URL: https://issues.apache.org/jira/browse/HBASE-21401 > Project: HBase > Issue Type: Sub-task > Components: regionserver >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Critical > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21401.v1.patch, HBASE-21401.v2.patch, > HBASE-21401.v3.patch, HBASE-21401.v4.patch, HBASE-21401.v4.patch, > HBASE-21401.v5.patch > > > In KeyValueDecoder & ByteBuffKeyValueDecoder, we pass a byte buffer to > initialize the Cell without a sanity check (check each field's offset > exceed the byte buffer or not), so ArrayIndexOutOfBoundsException may happen > when read the cell's fields, such as HBASE-21379, it's hard to debug this > kind of bug. > An earlier check will help to find such kind of bugs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21410) A helper page that help find all problematic regions and procedures
[ https://issues.apache.org/jira/browse/HBASE-21410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682730#comment-16682730 ] stack commented on HBASE-21410: --- [~tianjingyun] This patch looks really good. Give me a day or so. I want to try it. Having it in 2.1 only is fine I think. Thanks. > A helper page that help find all problematic regions and procedures > --- > > Key: HBASE-21410 > URL: https://issues.apache.org/jira/browse/HBASE-21410 > Project: HBase > Issue Type: Improvement >Affects Versions: 3.0.0, 2.2.0, 2.1.1 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21410.branch-2.1.001.patch, > HBASE-21410.branch-2.1.002.patch, HBASE-21410.master.001.patch, > HBASE-21410.master.002.patch, HBASE-21410.master.003.patch, > HBASE-21410.master.004.patch, Screenshot from 2018-10-30 19-06-21.png, > Screenshot from 2018-10-30 19-06-42.png, Screenshot from 2018-10-31 > 10-11-38.png, Screenshot from 2018-10-31 10-11-56.png, Screenshot from > 2018-11-01 17-56-02.png, Screenshot from 2018-11-01 17-56-15.png > > > *This page is mainly focus on finding the regions stuck in some state that > cannot be assigned. My proposal of the page is as follows:* > !Screenshot from 2018-10-30 19-06-21.png! > *From this page we can see all regions in RIT queue and their related > procedures. If we can determine that these regions' state are abnormal, we > can click the link 'Procedures as TXT' to get a full list of procedure IDs to > bypass them. Then click 'Regions as TXT' to get a full list of encoded region > names to assign.* > !Screenshot from 2018-10-30 19-06-42.png! > *Some region names are covered by the navigator bar, I'll fix it later.* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682724#comment-16682724 ] Zheng Hu commented on HBASE-21445: -- Got it, Thanks [~stack] . > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20952) Re-visit the WAL API
[ https://issues.apache.org/jira/browse/HBASE-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682723#comment-16682723 ] Hudson commented on HBASE-20952: Results for branch HBASE-20952 [build #46 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/HBASE-20952/46//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Re-visit the WAL API > > > Key: HBASE-20952 > URL: https://issues.apache.org/jira/browse/HBASE-20952 > Project: HBase > Issue Type: Improvement > Components: wal >Reporter: Josh Elser >Priority: Major > Attachments: 20952.v1.txt > > > Take a step back from the current WAL implementations and think about what an > HBase WAL API should look like. What are the primitive calls that we require > to guarantee durability of writes with a high degree of performance? > The API needs to take the current implementations into consideration. We > should also have a mind for what is happening in the Ratis LogService (but > the LogService should not dictate what HBase's WAL API looks like RATIS-272). > Other "systems" inside of HBase that use WALs are replication and > backup Replication has the use-case for "tail"'ing the WAL which we > should provide via our new API. B doesn't do anything fancy (IIRC). We > should make sure all consumers are generally going to be OK with the API we > create. > The API may be "OK" (or OK in a part). We need to also consider other methods > which were "bolted" on such as {{AbstractFSWAL}} and > {{WALFileLengthProvider}}. Other corners of "WAL use" (like the > {{WALSplitter}} should also be looked at to use WAL-APIs only). > We also need to make sure that adequate interface audience and stability > annotations are chosen. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682722#comment-16682722 ] stack commented on HBASE-21461: --- Lets do option #2. I can help. Its too early (or too late -- smile) for #3. I should try adding the patch here over in the tools repo? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682721#comment-16682721 ] stack commented on HBASE-21463: --- My bad. > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled
[ https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682718#comment-16682718 ] stack commented on HBASE-20604: --- Understood [~apurtell] New (sub-)issue sounds good. Just noting reviewer asked question back in May and it looked like it went unanswered. > ProtobufLogReader#readNext can incorrectly loop to the same position in the > stream until the the WAL is rolled > -- > > Key: HBASE-20604 > URL: https://issues.apache.org/jira/browse/HBASE-20604 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, > HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch > > > Every time we call {{ProtobufLogReader#readNext}} we consume the input stream > associated to the {{FSDataInputStream}} from the WAL that we are reading. > Under certain conditions, e.g. when using the encryption at rest > ({{CryptoInputStream}}) the stream can return partial data which can cause a > premature EOF that cause {{inputStream.getPos()}} to return to the same > origina position causing {{ProtobufLogReader#readNext}} to re-try over the > reads until the WAL is rolled. > The side effect of this issue is that {{ReplicationSource}} can get stuck > until the WAL is rolled and causing replication delays up to an hour in some > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682678#comment-16682678 ] Duo Zhang commented on HBASE-21463: --- Yes, only a UT(I have a 'UT' in the patch name...). Will provide a patch soon, as said above, change the behavior of checkOnlineRegions. And I believe the race could also happen for branch-2.1 and branch-2.0, HBASE-21421 is one possible problem. The issue described here may not be a problem as the behavior of AssignProcedure/UnassignProcedure maybe different from TRSP, but I'm afraid there could be other strange problems. > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled
[ https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682677#comment-16682677 ] Duo Zhang commented on HBASE-20604: --- I'm OK with opening a new issue to address the remaining problems, and I can take the charge but the problem is that, no one told me what is the real problem... And no failing UT is not a strong reason as I believe that the code we added here will not be executed in our existing UTs... The description is not very clear on what is going on, I would like to see more detailed explanation, better to point out the problematic code in CryptoInputStream. Is it the one in hadoop or in apache commons? Is there an existing jira abort it? Thanks. > ProtobufLogReader#readNext can incorrectly loop to the same position in the > stream until the the WAL is rolled > -- > > Key: HBASE-20604 > URL: https://issues.apache.org/jira/browse/HBASE-20604 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, > HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch > > > Every time we call {{ProtobufLogReader#readNext}} we consume the input stream > associated to the {{FSDataInputStream}} from the WAL that we are reading. > Under certain conditions, e.g. when using the encryption at rest > ({{CryptoInputStream}}) the stream can return partial data which can cause a > premature EOF that cause {{inputStream.getPos()}} to return to the same > origina position causing {{ProtobufLogReader#readNext}} to re-try over the > reads until the WAL is rolled. > The side effect of this issue is that {{ReplicationSource}} can get stuck > until the WAL is rolled and causing replication delays up to an hour in some > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682658#comment-16682658 ] Hudson commented on HBASE-21376: Results for branch branch-2 [build #1495 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1495//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files
[ https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682644#comment-16682644 ] Ted Yu commented on HBASE-21387: One more note about why I choose 21387.v9.txt as the version for review: priority is given to taking snapshot versus (delaying) cleaning snapshot files. This is because a failed snapshot has higher visibility compared to delayed snapshot cleaning. > Race condition surrounding in progress snapshot handling in snapshot cache > leads to loss of snapshot files > -- > > Key: HBASE-21387 > URL: https://issues.apache.org/jira/browse/HBASE-21387 > Project: HBase > Issue Type: Bug >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Labels: snapshot > Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, > 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, > two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt > > > During recent report from customer where ExportSnapshot failed: > {code} > 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] > snapshot.SnapshotReferenceUtil: Can't find hfile: > 44f6c3c646e84de6a63fe30da4fcb3aa in the real > (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > or archive > (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa) > directory for the primary table. > {code} > We found the following in log: > {code} > 2018-10-09 18:54:23,675 DEBUG > [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] > cleaner.HFileCleaner: Removing: > hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa > from archive > {code} > The root cause is race condition surrounding in progress snapshot(s) handling > between refreshCache() and getUnreferencedFiles(). > There are two callers of refreshCache: one from RefreshCacheTask#run and the > other from SnapshotHFileCleaner. > Let's look at the code of refreshCache: > {code} > if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) { > {code} > whose intention is to exclude in progress snapshot(s). > Suppose when the RefreshCacheTask runs refreshCache, there is some in > progress snapshot (about to finish). > When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that > lastModifiedTime is up to date. So cleaner proceeds to check in progress > snapshot(s). However, the snapshot has completed by that time, resulting in > some file(s) deemed unreferenced. > Here is timeline given by Josh illustrating the scenario: > At time T0, we are checking if F1 is referenced. At time T1, there is a > snapshot S1 in progress that is referencing a file F1. refreshCache() is > called, but no completed snapshot references F1. At T2, the snapshot S1, > which references F1, completes. At T3, we check in-progress snapshots and S1 > is not included. Thus, F1 is marked as unreferenced even though S1 references > it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface
[ https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682638#comment-16682638 ] Ted Yu commented on HBASE-21246: Currently there are about 69 failing test classes. Working through these failed tests. > Introduce WALIdentity interface > --- > > Key: HBASE-21246 > URL: https://issues.apache.org/jira/browse/HBASE-21246 > Project: HBase > Issue Type: Sub-task >Reporter: Ted Yu >Assignee: Ted Yu >Priority: Major > Fix For: HBASE-20952 > > Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, > 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, > 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, > 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, > 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, > replication-src-creates-wal-reader.jpg, wal-factory-providers.png, > wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg > > > We are introducing WALIdentity interface so that the WAL representation can > be decoupled from distributed filesystem. > The interface provides getName method whose return value can represent > filename in distributed filesystem environment or, the name of the stream > when the WAL is backed by log stream. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled
[ https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682571#comment-16682571 ] Andrew Purtell commented on HBASE-20604: I have no strong opinion, could go either way. Certainly better late than never for more review. However it’s not great for a contributor to have us complete a review, and it was completed, see above, and then have a commit after due testing, which we also have, only to see the issue reopened. I argue this is suboptimal process. A better alternative is a new Jira and the reviewer who volunteered more suggestions can be given the option to perform the work on the new jira, or maybe Esteban would be interested. For the sake of predictability in our process. > ProtobufLogReader#readNext can incorrectly loop to the same position in the > stream until the the WAL is rolled > -- > > Key: HBASE-20604 > URL: https://issues.apache.org/jira/browse/HBASE-20604 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, > HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch > > > Every time we call {{ProtobufLogReader#readNext}} we consume the input stream > associated to the {{FSDataInputStream}} from the WAL that we are reading. > Under certain conditions, e.g. when using the encryption at rest > ({{CryptoInputStream}}) the stream can return partial data which can cause a > premature EOF that cause {{inputStream.getPos()}} to return to the same > origina position causing {{ProtobufLogReader#readNext}} to re-try over the > reads until the WAL is rolled. > The side effect of this issue is that {{ReplicationSource}} can get stuck > until the WAL is rolled and causing replication delays up to an hour in some > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682563#comment-16682563 ] Wellington Chevreuil commented on HBASE-21461: -- {quote}It could be here on this JIRA w/ instructions on how to build. Might be ok given limited audience... but wouldn't encourage confidence in the 'hosed' operator. {quote} Maybe have a built jar available here with install instructions. I guess requiring a whole build environment setup would be too discouraging for admins/operators. {quote}Or it'd be in tools repo... Your plan for a replication submodule sounds good. In it would be a submodule for this cp ... setting the jdk7 compile target and having dependency on branch-1. {quote} This approach benefit is that we start tide up the house and put most of support/operations "hacks" in its specific shelves. BTW, should we have another jira/thread to discuss what else could be moved to "/operator-tools/replication" submodule (assuming there's none yet)? {quote}Or, we start the cp 'store' repo... where we start putting cps. (smile). {quote} That could be another way to organise extra tools/features. Are there other CPs planned to be moved out of hbase main project? > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-20604) ProtobufLogReader#readNext can incorrectly loop to the same position in the stream until the the WAL is rolled
[ https://issues.apache.org/jira/browse/HBASE-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682561#comment-16682561 ] stack commented on HBASE-20604: --- [~esteban] See above. Should we reopen this issue since outstanding review? > ProtobufLogReader#readNext can incorrectly loop to the same position in the > stream until the the WAL is rolled > -- > > Key: HBASE-20604 > URL: https://issues.apache.org/jira/browse/HBASE-20604 > Project: HBase > Issue Type: Bug > Components: Replication, wal >Affects Versions: 3.0.0 >Reporter: Esteban Gutierrez >Assignee: Esteban Gutierrez >Priority: Critical > Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-20604.002.patch, HBASE-20604.003.patch, > HBASE-20604.004.patch, HBASE-20604.005.patch, HBASE-20604.patch > > > Every time we call {{ProtobufLogReader#readNext}} we consume the input stream > associated to the {{FSDataInputStream}} from the WAL that we are reading. > Under certain conditions, e.g. when using the encryption at rest > ({{CryptoInputStream}}) the stream can return partial data which can cause a > premature EOF that cause {{inputStream.getPos()}} to return to the same > origina position causing {{ProtobufLogReader#readNext}} to re-try over the > reads until the WAL is rolled. > The side effect of this issue is that {{ReplicationSource}} can get stuck > until the WAL is rolled and causing replication delays up to an hour in some > cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21454) Kill zk spew
[ https://issues.apache.org/jira/browse/HBASE-21454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682559#comment-16682559 ] stack commented on HBASE-21454: --- I would like us to take control of what gets logged and when. For too long, zk has been over-sharing at INFO level anytime we start an hbase process. I like the [~busbey] suggestion that what is here is too radical that we need to add back some of what zk was doing but under our terms and by our classes rather than when zk wants to. Let me make a v2. > Kill zk spew > > > Key: HBASE-21454 > URL: https://issues.apache.org/jira/browse/HBASE-21454 > Project: HBase > Issue Type: Bug > Components: logging, Zookeeper >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21454.master.001.patch > > > Kill the zk spew. This is radical dropping startup listing of CLASSPATH and > all properties. Can dial back-in what we need after this patch goes in. > I get spew each time I run a little command in spark-shell. Annoying. Always > been annoying in all logs. > More might be needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682555#comment-16682555 ] stack commented on HBASE-21445: --- +1 on the patch. In future, consider leaving out reformatting changes like those in this patch... They seem arbitrary... and bulk up what could have been a two liner. Otherwise, very nice patch. Thanks [~openinx] > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682554#comment-16682554 ] stack commented on HBASE-21463: --- Is there a change in the patch? I see making a method visible for testing, a reformat of a log message, and a nice looking UT but where is the fix? > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted
[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682544#comment-16682544 ] Josh Elser commented on HBASE-21440: Ah, I think I see what the issue is. The change to AP#remoteCallFailed to return false would leave the AP suspended whereas before it will wake up the procedure again (see UP#remoteCallFailed). Might need to rework out we propagate the success/failure back up the stack. > Assign procedure on the crashed server is not properly interrupted > -- > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21440.branch-2.0.001.patch, > HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682538#comment-16682538 ] stack commented on HBASE-21461: --- It could be here on this JIRA w/ instructions on how to build. Might be ok given limited audience... but wouldn't encourage confidence in the 'hosed' operator. Or it'd be in tools repo... Your plan for a replication submodule sounds good. In it would be a submodule for this cp ... setting the jdk7 compile target and having dependency on branch-1. Or, we start the cp 'store' repo... where we start putting cps. (smile). > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC issue scenarios. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21440) Assign procedure on the crashed server is not properly interrupted
[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682532#comment-16682532 ] Josh Elser commented on HBASE-21440: Let me try to put up a v4 :) > Assign procedure on the crashed server is not properly interrupted > -- > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.2 >Reporter: Ankit Singhal >Assignee: Ankit Singhal >Priority: Major > Attachments: HBASE-21440.branch-2.0.001.patch, > HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
[ https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682531#comment-16682531 ] Josh Elser edited comment on HBASE-21457 at 11/10/18 6:10 PM: -- {quote}Looking for where the timeout should be increased. {quote} If it's not explicit, maybe it's inherited from the test-categorization? {quote} 200 seconds were really long. Though I don't see meaningful exception in test output related to master initialization. {quote} Other B tests seem to have run for over 600s. Is an exception not being logged correctly? Is there some timeout happening within hadoop mini clusters? There doesn't seem to be an obvious reason in HBase as to where this timeout is coming from, but you have the tools to dig in to figure out why this happens. was (Author: elserj): {quote}Looking for where the timeout should be increased. {quote} If it's not explicit, maybe it's inherited from the test-categorization? {quote} 200 seconds were really long. Though I don't see meaningful exception in test output related to master initialization. {quote} Other B tests seem to have run for over 600s. Is an exception not being logged correctly? Is there some timeout happening within hadoop mini clusters? > BackupUtils#getWALFilesOlderThan refers to wrong FileSystem > --- > > Key: HBASE-21457 > URL: https://issues.apache.org/jira/browse/HBASE-21457 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Janos Gub >Assignee: Ted Yu >Priority: Major > Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt > > > Janos reported seeing backup test failure when testing a local HDFS for WALs > while using WASB/ADLS only for store files. > Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase > root dir for retrieving WAL files. > We should use the helper methods from CommonFSUtils. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
[ https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682531#comment-16682531 ] Josh Elser commented on HBASE-21457: {quote}Looking for where the timeout should be increased. {quote} If it's not explicit, maybe it's inherited from the test-categorization? {quote} 200 seconds were really long. Though I don't see meaningful exception in test output related to master initialization. {quote} Other B tests seem to have run for over 600s. Is an exception not being logged correctly? Is there some timeout happening within hadoop mini clusters? > BackupUtils#getWALFilesOlderThan refers to wrong FileSystem > --- > > Key: HBASE-21457 > URL: https://issues.apache.org/jira/browse/HBASE-21457 > Project: HBase > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Janos Gub >Assignee: Ted Yu >Priority: Major > Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt > > > Janos reported seeing backup test failure when testing a local HDFS for WALs > while using WASB/ADLS only for store files. > Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase > root dir for retrieving WAL files. > We should use the helper methods from CommonFSUtils. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682510#comment-16682510 ] Hudson commented on HBASE-21376: Results for branch branch-2.0 [build #1073 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1073//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682494#comment-16682494 ] Hudson commented on HBASE-21376: Results for branch branch-2.1 [build #594 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/594//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-13468) hbase.zookeeper.quorum supports ipv6 address
[ https://issues.apache.org/jira/browse/HBASE-13468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated HBASE-13468: --- Resolution: Fixed Hadoop Flags: Reviewed Fix Version/s: 3.0.0 Status: Resolved (was: Patch Available) Thanks for the patch, maoling Thanks for the review, Mike > hbase.zookeeper.quorum supports ipv6 address > > > Key: HBASE-13468 > URL: https://issues.apache.org/jira/browse/HBASE-13468 > Project: HBase > Issue Type: Bug >Reporter: Mingtao Zhang >Assignee: maoling >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-13468.master.001.patch, > HBASE-13468.master.002.patch, HBASE-13468.master.003.patch, > HBASE-13468.master.004.patch > > > I put ipv6 address in hbase.zookeeper.quorum, by the time this string went to > zookeeper code, the address is messed up, i.e. only '[1234' left. > I started using pseudo mode with embedded zk = true. > I downloaded 1.0.0, not sure which affected version should be here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682410#comment-16682410 ] Guanghao Zhang commented on HBASE-21463: Great UT. Not easy to find this bug and represent this problem.. > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682388#comment-16682388 ] Duo Zhang commented on HBASE-21463: --- A UT to represent the problem. [~zghaobac] FYI. If no other opinions, I will completely change the behavior of checkOnlineRegions to only report possible inconsistency instead of trying to fix them. > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21463: -- Attachment: HBASE-21463-UT.patch > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21463-UT.patch > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682394#comment-16682394 ] Hudson commented on HBASE-21445: Results for branch master [build #596 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/596/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop3)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. (x) {color:red}-1 client integration test{color} -- Something went wrong with this stage, [check relevant console output|https://builds.apache.org/job/HBase%20Nightly/job/master/596//console]. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21437) Bypassed procedure throw IllegalArgumentException when its state is WAITING_TIMEOUT
[ https://issues.apache.org/jira/browse/HBASE-21437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682393#comment-16682393 ] Hudson commented on HBASE-21437: Results for branch master [build #596 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/596/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/master/596//JDK8_Nightly_Build_Report_(Hadoop3)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. (x) {color:red}-1 client integration test{color} -- Something went wrong with this stage, [check relevant console output|https://builds.apache.org/job/HBase%20Nightly/job/master/596//console]. > Bypassed procedure throw IllegalArgumentException when its state is > WAITING_TIMEOUT > --- > > Key: HBASE-21437 > URL: https://issues.apache.org/jira/browse/HBASE-21437 > Project: HBase > Issue Type: Bug >Affects Versions: 2.1.1, 2.0.2 >Reporter: Jingyun Tian >Assignee: Jingyun Tian >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21437.master.001.patch, > HBASE-21437.master.002.patch, HBASE-21437.master.003.patch > > > {code} > 2018-11-05,18:25:52,735 WARN > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminating > UNNATURALLY null > java.lang.IllegalArgumentException: NOT RUNNABLE! pid=3, > state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, hasLock=true, > bypass=true; TransitRegionStateProcedure table=test_fail > over, region=1bb029ba4ec03b92061be5c4329d2096, UNASSIGN > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:134) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1620) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1384) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1948) > 2018-11-05,18:25:52,736 TRACE > org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker terminated. > {code} > Since when we bypassed a WAITING_TIMEOUT procedure and resubmit it, its state > is still WAITING_TIMEOUT, then when executor run this procedure, it will > throw exception and cause worker terminated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21461) Region CoProcessor for splitting large WAL entries in smaller batches, to handle situation when faulty ingestion had created too many mutations for same cell in single
[ https://issues.apache.org/jira/browse/HBASE-21461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682392#comment-16682392 ] Wellington Chevreuil commented on HBASE-21461: -- Thanks for the insights, [~stack]! {quote}I agree it an operator-tool but it is a bit 'odd' being branch-1 only and a CP only (small audience – but super cool throwing these hosed operators a lifeline...). {quote} My thought was to have it as the first feature of "replication" sub-module from operator-tool. Any potential other utilities for replication operation related issues could be then placed there as well. The limited audience, though, might be indeed something to consider if it's really worth the effort for now. {quote}How would we package it? Would we build a jar over in hbase-operator-tool and then operator would take it and install when they had a constipated replication stream? {quote} Yeah, operators would need to download (if we are planning to expose a download page for operator-tool) or build it and install it. Put that way, does not sound really like a tool, since it's not a simple matter of running an external application that interacts and fixes hbase problems. Maybe we should call it a "medicine" (a laxative one :)). {quote}One other thought is that we add to the refguide a section on constipation (smile) w/ a pointer here w/ instructions on how to install. {quote} Liked this idea too. In this case where and how would we place the CP? Were you thinking on providing the builtin jar somewhere, or just the raw code in patch format attached to a jira? I tend to prefer the former, as a mean to cover a broader audience of operators that may not be familiar with the build process. > Region CoProcessor for splitting large WAL entries in smaller batches, to > handle situation when faulty ingestion had created too many mutations for > same cell in single batch > - > > Key: HBASE-21461 > URL: https://issues.apache.org/jira/browse/HBASE-21461 > Project: HBase > Issue Type: New Feature > Components: hbase-operator-tools, Replication >Reporter: Wellington Chevreuil >Assignee: Wellington Chevreuil >Priority: Minor > Attachments: 0001-Initial-version-for-WAL-entry-splitter-CP.txt > > > With replication enabled deployments, it's possible that faulty ingestion > clients may lead to single WalEntry containing too many edits for same cell. > This would cause *ReplicationSink,* in the target cluster, to attempt single > batch mutation with too many operations, what in turn can lead to very large > RPC requests, which may not fit in the final target RS rpc queue. In this > case, the messages below are seen on target RS trying to perform the sink: > {noformat} > WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=TABLE_NAME, > attempt=4/4 failed=2ops, last exception: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): > Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size > too small? on regionserver01.example.com,60020,1524334173359, tracking > started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure > 2018-09-07 10:40:59,506 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to > accept edit because: > org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 2 > actions: RemoteWithExtrasException: 2 times, > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:247) > at > org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1800(AsyncProcess.java:227) > at > org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.getErrors(AsyncProcess.java:1663) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:982) > at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:996){noformat} > When this problem manifests, replication will be stuck and wal files will be > piling up on source cluster WALs/oldWALs folder. Typical workaround requires > manual cleanup of replication znodes in ZK, and manual WAL replay for the WAL > files containing the large entry. > This CP would handle the issue, by checking for large wal entries and > splitting those into smaller batches on the *reReplicateLogEntries* method > hook. > *Additional Note*: HBASE-18027 introduced some safeguards such large RPC > requests, which may already help avoid such scenario. That is not available > for 1.2 releases, though, and this CP tool may still be relevant for 1.2 > clusters. It may also be still worth having it to workaround any potential > unknown large RPC
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682361#comment-16682361 ] Hudson commented on HBASE-21445: Results for branch branch-2.1 [build #593 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593/]: (/) *{color:green}+1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.1/593//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682356#comment-16682356 ] Hudson commented on HBASE-21445: Results for branch branch-2.0 [build #1072 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//General_Nightly_Build_Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.0/1072//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682351#comment-16682351 ] Hudson commented on HBASE-21445: Results for branch branch-2 [build #1494 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//General_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/1494//JDK8_Nightly_Build_Report_(Hadoop3)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682349#comment-16682349 ] Hudson commented on HBASE-21445: Results for branch branch-1.4 [build #543 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.4/543//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682320#comment-16682320 ] Hudson commented on HBASE-21445: Results for branch branch-1.3 [build #536 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.3/536//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682307#comment-16682307 ] Hudson commented on HBASE-21445: Results for branch branch-1 [build #546 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546/]: (x) *{color:red}-1 overall{color}* details (if available): (x) {color:red}-1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//General_Nightly_Build_Report/] (x) {color:red}-1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1/546//JDK8_Nightly_Build_Report_(Hadoop2)/] (x) {color:red}-1 source release artifact{color} -- See build output for details. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21463) The checkOnlineRegionsReport can accidentally complete a TRSP
[ https://issues.apache.org/jira/browse/HBASE-21463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21463: -- Priority: Critical (was: Major) > The checkOnlineRegionsReport can accidentally complete a TRSP > - > > Key: HBASE-21463 > URL: https://issues.apache.org/jira/browse/HBASE-21463 > Project: HBase > Issue Type: Sub-task > Components: amv2 >Reporter: Duo Zhang >Priority: Critical > Fix For: 3.0.0, 2.2.0 > > > On our testing cluster, we observe a race condition: > 1. A regionServerReport request is built > 2. A TRSP is scheduled to reopen the region > 3. The region is closed at RS side > 4. The OpenRegionProcedure is created > 5. The regionServerReport generated at step 1 is executed, and we find that > the region is opened on the RS, so we update the region state to OPEN. > 6. The OpenRegionProcedure notices that the region has already been in the > OPEN state so gives up and finishes. > 7. The TRSP finishes. > 8. The region is recorded as OPEN on the RS but actually not, and can not > recover unless we use HBCK2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21445) CopyTable by bulkload will write hfile into yarn's HDFS
[ https://issues.apache.org/jira/browse/HBASE-21445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682306#comment-16682306 ] Hudson commented on HBASE-21445: Results for branch branch-1.2 [build #545 on builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545/]: (x) *{color:red}-1 overall{color}* details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//General_Nightly_Build_Report/] (/) {color:green}+1 jdk7 checks{color} -- For more information [see jdk7 report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//JDK7_Nightly_Build_Report/] (x) {color:red}-1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://builds.apache.org/job/HBase%20Nightly/job/branch-1.2/545//JDK8_Nightly_Build_Report_(Hadoop2)/] (/) {color:green}+1 source release artifact{color} -- See build output for details. > CopyTable by bulkload will write hfile into yarn's HDFS > > > Key: HBASE-21445 > URL: https://issues.apache.org/jira/browse/HBASE-21445 > Project: HBase > Issue Type: Bug > Components: mapreduce >Reporter: Zheng Hu >Assignee: Zheng Hu >Priority: Major > Fix For: 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.9 > > Attachments: HBASE-21445.v1.patch > > > When using CopyTable with bulkload, I found that all hfile's are written in > our Yarn's HDFS cluster. and failed to load hfiles into HBase cluster, > because we use different HDFS between yarn cluster and hbase cluster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21376: -- Resolution: Fixed Assignee: Duo Zhang (was: Allan Yang) Status: Resolved (was: Patch Available) Pushed to branch-2.0+. > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21377) Missing procedure stack index when restarting
[ https://issues.apache.org/jira/browse/HBASE-21377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682302#comment-16682302 ] Duo Zhang commented on HBASE-21377: --- Never seen this for a long time. Plan to close this after the TestMergeTableRegionsProcedure is moved out from the flakey list. > Missing procedure stack index when restarting > - > > Key: HBASE-21377 > URL: https://issues.apache.org/jira/browse/HBASE-21377 > Project: HBase > Issue Type: Sub-task > Components: proc-v2 >Reporter: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21377-debuglog.patch > > > TestMergeTableRegionsProcedure is still flakey, and found this in the output > {noformat} > 2018-10-24 03:46:12,842 ERROR [Time-limited test] wal.WALProcedureTree(198): > Missing stack id 6, max stack id is 8, root procedure is Procedure(pid=42, > ppid=-1, > class=org.apache.hadoop.hbase.master.assignment.MergeTableRegionsProcedure) > 2018-10-24 03:46:12,847 ERROR [Time-limited test] > procedure2.ProcedureExecutor$2(451): Corrupt pid=42, > state=WAITING:MERGE_TABLE_REGIONS_CHECK_CLOSED_REGIONS, hasLock=false; > MergeTableRegionsProcedure table=testRollbackAndDoubleExecution, > regions=[72aed4d14ac73faaa1755e248a55b71a, a848f3ca26989865d59cd0683ae6], > forcibly=false > 2018-10-24 03:46:12,847 ERROR [Time-limited test] > procedure2.ProcedureExecutor$2(451): Corrupt pid=43, ppid=42, > state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; > TransitRegionStateProcedure table=testRollbackAndDoubleExecution, > region=72aed4d14ac73faaa1755e248a55b71a, UNASSIGN > 2018-10-24 03:46:12,848 ERROR [Time-limited test] > procedure2.ProcedureExecutor$2(451): Corrupt pid=44, ppid=42, > state=WAITING:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=false; > TransitRegionStateProcedure table=testRollbackAndDoubleExecution, > region=a848f3ca26989865d59cd0683ae6, UNASSIGN > 2018-10-24 03:46:12,848 ERROR [Time-limited test] > procedure2.ProcedureExecutor$2(451): Corrupt pid=45, ppid=43, state=SUCCESS, > hasLock=false; org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure > 2018-10-24 03:46:12,849 ERROR [Time-limited test] > procedure2.ProcedureExecutor$2(451): Corrupt pid=46, ppid=44, state=RUNNABLE, > hasLock=false; org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure > {noformat} > Need to dig more. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21376: -- Fix Version/s: 2.2.0 3.0.0 > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (HBASE-21376) Add some verbose log to MasterProcedureScheduler
[ https://issues.apache.org/jira/browse/HBASE-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-21376: -- Component/s: proc-v2 logging > Add some verbose log to MasterProcedureScheduler > > > Key: HBASE-21376 > URL: https://issues.apache.org/jira/browse/HBASE-21376 > Project: HBase > Issue Type: Sub-task > Components: logging, proc-v2 >Reporter: Allan Yang >Assignee: Duo Zhang >Priority: Major > Fix For: 3.0.0, 2.2.0, 2.0.3, 2.1.2 > > Attachments: HBASE-21376.branch-2.0.001.patch, > HBASE-21376.branch-2.0.001.patch, HBASE-21376.patch, HBASE-21376.patch > > > As discussed in HBASE-21364, we divided the patch in HBASE-21364 to two, the > critical one is already submitted in HBASE-21364 to branch-2.0 and > branch-2.1, but I also added some useful logs which need to commit to all > branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21454) Kill zk spew
[ https://issues.apache.org/jira/browse/HBASE-21454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682293#comment-16682293 ] Duo Zhang commented on HBASE-21454: --- I think we should find a way to disable all logs when executing shell? But for a running master or regionserver instance the log is useful... > Kill zk spew > > > Key: HBASE-21454 > URL: https://issues.apache.org/jira/browse/HBASE-21454 > Project: HBase > Issue Type: Bug > Components: logging, Zookeeper >Reporter: stack >Assignee: stack >Priority: Major > Fix For: 3.0.0, 2.2.0 > > Attachments: HBASE-21454.master.001.patch > > > Kill the zk spew. This is radical dropping startup listing of CLASSPATH and > all properties. Can dial back-in what we need after this patch goes in. > I get spew each time I run a little command in spark-shell. Annoying. Always > been annoying in all logs. > More might be needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)