[jira] [Commented] (HBASE-20226) Performance Improvement Taking Large Snapshots In Remote Filesystems

2020-07-27 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166082#comment-17166082
 ] 

Ted Yu commented on HBASE-20226:


{code}
+if (v1Regions.size() > 0 || v2Regions.size() > 0) {
{code}
It seems the thread pool is needed when v1Regions.size()+v2Regions.size() > 1.

There are also a few findbugs warnings to be addressed.

> Performance Improvement Taking Large Snapshots In Remote Filesystems
> 
>
> Key: HBASE-20226
> URL: https://issues.apache.org/jira/browse/HBASE-20226
> Project: HBase
>  Issue Type: Improvement
>  Components: snapshots
>Affects Versions: 1.4.0
> Environment: HBase 1.4.0 running on an AWS EMR cluster with the 
> hbase.rootdir set to point to a folder in S3 
>Reporter: Saad Mufti
>Priority: Minor
> Attachments: HBASE-20226..01.patch
>
>
> When taking a snapshot of any table, one of the last steps is to delete the 
> region manifests, which have already been rolled up into a larger overall 
> manifest and thus have redundant information.
> This proposal is to do the deletion in a thread pool bounded by 
> hbase.snapshot.thread.pool.max . For large tables with a lot of regions, the 
> current single threaded deletion is taking longer than all the rest of the 
> snapshot tasks when the Hbase data and the snapshot folder are both in a 
> remote filesystem like S3.
> I have a patch for this proposal almost ready and will submit it tomorrow for 
> feedback, although I haven't had a chance to write any tests yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-24137) The max merge count of metafixer should be configurable in MetaFixer

2020-04-08 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-24137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17078854#comment-17078854
 ] 

Ted Yu commented on HBASE-24137:


Please find people who have touched this class to review.
I may not have time to review.

> The max merge count of metafixer should be configurable in MetaFixer
> 
>
> Key: HBASE-24137
> URL: https://issues.apache.org/jira/browse/HBASE-24137
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Yu Wang
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: 24137_master_1.patch
>
>
> The max merge count of metafixer should be configurable in MetaFixer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21619) Fix warning message caused by incorrect ternary operator evaluation

2018-12-19 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725397#comment-16725397
 ] 

Ted Yu commented on HBASE-21619:


lgtm

> Fix warning message caused by incorrect ternary operator evaluation
> ---
>
> Key: HBASE-21619
> URL: https://issues.apache.org/jira/browse/HBASE-21619
> Project: HBase
>  Issue Type: Bug
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>Priority: Trivial
> Attachments: HBASE-21619.master.001.patch
>
>
> {code:title=LoadIncrementalHFiles#doBulkLoad}
> LOG.warn(
>   "Bulk load operation did not find any files to load in " + 
> "directory " + hfofDir != null
>   ? hfofDir.toUri().toString()
>   : "" + ".  Does it contain files in " +
>   "subdirectories that correspond to column family names?");
> {code}
> JDK complains {{"Bulk load operation did not find any files to load in " + 
> "directory " + hfofDir != null}} is always true, which is not what is 
> intended, and that produces a wrong message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-12-07 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Status: Patch Available  (was: Open)

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.43.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> HBASE-21246.master.001.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21479) Individual tests in TestHRegionReplayEvents class are failing

2018-12-01 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705903#comment-16705903
 ] 

Ted Yu commented on HBASE-21479:


Ran 
TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
locally which passed

> Individual tests in TestHRegionReplayEvents class are failing
> -
>
> Key: HBASE-21479
> URL: https://issues.apache.org/jira/browse/HBASE-21479
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Peter Somogyi
>Priority: Major
> Fix For: 3.0.0, 2.2.0, 2.1.2
>
> Attachments: HBASE-21479-v1.patch, testHRegionReplayEvents-output.txt
>
>
> The test fails in both master branch and branch-2 :
> {code}
> testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
>   Time elapsed: 3.74 sec  <<< ERROR!
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-26 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Fix Version/s: 1.4.9
   1.3.3

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 1.4.9, 2.1.2
>
> Attachments: 21511.branch-1.v3.txt, 21511.branch-1.v4.txt, 
> 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-25 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: (was: 2.0.4)
   (was: 1.4.9)
   (was: 1.3.3)
   Status: Resolved  (was: Patch Available)

Thanks for the review, Zheng.

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.2.0, 2.1.2
>
> Attachments: 21511.branch-1.v3.txt, 21511.branch-1.v4.txt, 
> 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-25 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.branch-1.v4.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 1.4.9, 2.1.2, 2.0.4
>
> Attachments: 21511.branch-1.v3.txt, 21511.branch-1.v4.txt, 
> 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-25 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.branch-1.v3.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 1.4.9, 2.1.2, 2.0.4
>
> Attachments: 21511.branch-1.v3.txt, 21511.v1.txt, 21511.v2.txt, 
> 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-25 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu reassigned HBASE-21511:
--

Assignee: Ted Yu

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 1.4.9, 2.1.2, 2.0.4
>
> Attachments: 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v3.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt, 21511.v2.txt, 21511.v3.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v2.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt, 21511.v2.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21511.v2.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> 21511.v2.txt, HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v1.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: (was: 21511.v1.txt)

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21511:
--

 Summary: Remove in progress snapshot check in 
SnapshotFileCache#getUnreferencedFiles
 Key: HBASE-21511
 URL: https://issues.apache.org/jira/browse/HBASE-21511
 Project: HBase
  Issue Type: Improvement
Reporter: Ted Yu
 Attachments: 21511.v1.txt

During review of HBASE-21387, [~Apache9] mentioned that the check for in 
progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
needed now that snapshot hfile cleaner and taking snapshot are mutually 
exclusive.

This issue is to address the review comment by removing the check for in 
progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Status: Patch Available  (was: Open)

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21511) Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles

2018-11-24 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21511:
---
Attachment: 21511.v1.txt

> Remove in progress snapshot check in SnapshotFileCache#getUnreferencedFiles
> ---
>
> Key: HBASE-21511
> URL: https://issues.apache.org/jira/browse/HBASE-21511
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 21511.v1.txt
>
>
> During review of HBASE-21387, [~Apache9] mentioned that the check for in 
> progress snapshots in SnapshotFileCache#getUnreferencedFiles is no longer 
> needed now that snapshot hfile cleaner and taking snapshot are mutually 
> exclusive.
> This issue is to address the review comment by removing the check for in 
> progress snapshots.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-23 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697623#comment-16697623
 ] 

Ted Yu commented on HBASE-21387:


Ran the failed test locally with addendum which passed.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-23 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Status: Patch Available  (was: Reopened)

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-23 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697520#comment-16697520
 ] 

Ted Yu commented on HBASE-21387:


TestSnapshotFileCache fails across branches.

https://builds.apache.org/job/HBase-Flaky-Tests/job/master/1987/testReport/junit/org.apache.hadoop.hbase.master.snapshot/TestSnapshotFileCache/

One condition on SnapshotFileCache was incorrect, resulting in the following 
being logged repeatedly.
{code}
  LOG.warn("Not checking unreferenced files since snapshot is running, 
it will "
  + "skip to clean the HFiles this time");
{code}
With addendum, the test passes.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-23 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21387.addendum.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.addendum.txt, 
> 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-23 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu reopened HBASE-21387:


> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, 
> 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 
> 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-22 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Hadoop Flags: Reviewed
Release Note: To prevent race condition between in progress snapshot 
(performed by TakeSnapshotHandler) and HFileCleaner which results in data loss, 
this JIRA introduced mutual exclusion between taking snapshot and running 
HFileCleaner. That is, at any given moment, either some snapshot can be taken 
or, HFileCleaner checks hfiles which are not referenced, but not both can be 
running.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, 
> 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 
> 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-22 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16696227#comment-16696227
 ] 

Ted Yu commented on HBASE-21387:


Looks like catching FileNotFoundException is not enough to pass the new test.

Let's go with v17.
{code}
+LOG.debug("toDeleteFiles[{}] is: " + deletableFiles.get(i));
{code}
Minor: looks like you intended to provide both index and FileStatus. There was 
only one argument above.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, 
> 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 
> 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-22 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21387-suggest.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Fix For: 3.0.0, 1.5.0, 1.3.3, 2.2.0, 2.0.3, 1.4.9, 2.1.2, 1.2.10
>
> Attachments: 0001-UT.patch, 21387-suggest.txt, 21387.dbg.txt, 
> 21387.v10.txt, 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 
> 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> HBASE-21387.branch-1.2.patch, HBASE-21387.branch-1.3.patch, 
> HBASE-21387.branch-1.patch, HBASE-21387.v13.patch, HBASE-21387.v14.patch, 
> HBASE-21387.v15.patch, HBASE-21387.v16.patch, HBASE-21387.v17.patch, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-20 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693142#comment-16693142
 ] 

Ted Yu commented on HBASE-21387:


I was aware of the above JIRA.

Thanks for the unit test, Zheng.



> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 0001-UT.patch, 21387.dbg.txt, 21387.v10.txt, 
> 21387.v11.txt, 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, HBASE-21387.v13.patch, 
> HBASE-21387.v14.patch, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2018-11-18 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16690975#comment-16690975
 ] 

Ted Yu commented on HBASE-21478:


bq. what about adding another private member as a copy of tables into 
RSGroupInfo

Have you considered memory consumption by the extra SortedSet ?
You can try this approach.
Please make the private field name obvious that it is for display only.

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Minor
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not sorted, so 
> not quite easy to read, like:
> {code}
> hbase(main):003:0> get_rsgroup 'default'
> GROUP INFORMATION
> ...
> Tables:
> table3
> ns2:table22
> table1
> ns1:table11
> ...
> {code}
> They could be sorted in the order of namespace then table name:
> {code}
> table1
> table3
> ns1:table11
> ns2:table22
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21141) Enable MOB in backup / restore test involving incremental backup

2018-11-16 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21141:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, Artem.

> Enable MOB in backup / restore test involving incremental backup
> 
>
> Key: HBASE-21141
> URL: https://issues.apache.org/jira/browse/HBASE-21141
> Project: HBase
>  Issue Type: Test
>  Components: backuprestore
>Reporter: Ted Yu
>Assignee: Artem Ervits
>Priority: Major
>  Labels: mob
> Fix For: 3.0.0
>
> Attachments: HBASE-21141.v01.patch, HBASE-21141.v02.patch, 
> HBASE-21141.v03.patch, HBASE-21141.v04.patch
>
>
> Currently we only have one test (TestRemoteBackup) where MOB feature is 
> enabled. The test only performs full backup.
> This issue is to enable MOB in backup / restore test(s) involving incremental 
> backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21482) TestHRegion fails due to 'Too many open files'

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689903#comment-16689903
 ] 

Ted Yu commented on HBASE-21482:


Toward the end of 
hbase-server/target/surefire-reports/org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt
 
branch-2 :
{code}
2018-11-15 19:05:42,036 INFO  [Time-limited test] hbase.ResourceChecker(172): 
after: regionserver.TestHRegion#testCheckAndDelete_ThatDeleteWasWritten 
Thread=85 (was 85), OpenFileDescriptor=1276 (was 1273) - OpenFileDescriptor 
LEAK? -, MaxFileDescriptor=32000 (was 32000), SystemLoadAverage=149 (was 149), 
ProcessCount=361 (was 361), AvailableMemoryMB=36487 (was 36488)
{code}
master branch:
{code}
2018-11-16 19:06:59,290 INFO  [Time-limited test] hbase.ResourceChecker(172): 
after: regionserver.TestHRegion#testCheckAndDelete_ThatDeleteWasWritten 
Thread=79 (was 78) - Thread LEAK? -, OpenFileDescriptor=31932 (was 31934), 
MaxFileDescriptor=32000 (was 32000), SystemLoadAverage=82 (was 82), 
ProcessCount=363 (was 363), AvailableMemoryMB=36785 (was 36784) - 
AvailableMemoryMB LEAK? -
2018-11-16 19:06:59,290 WARN  [Time-limited test] hbase.ResourceChecker(135): 
OpenFileDescriptor=31932 is superior to 1024
{code}

> TestHRegion fails due to 'Too many open files'
> --
>
> Key: HBASE-21482
> URL: https://issues.apache.org/jira/browse/HBASE-21482
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: 
> org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt, 
> org.apache.hadoop.hbase.regionserver.TestHRegion.txt
>
>
> TestHRegion fails due to 'Too many open files' in master branch.
> Here is one failed subtest :
> {code}
> testCheckAndDelete_ThatDeleteWasWritten(org.apache.hadoop.hbase.regionserver.TestHRegion)
>   Time elapsed: 2.373 sec  <<< ERROR!
> java.lang.IllegalStateException: failed to create a child event loop
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: org.apache.hbase.thirdparty.io.netty.channel.ChannelException: 
> failed to open a new selector
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: java.io.IOException: Too many open files
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21141) Enable MOB in backup / restore test involving incremental backup

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689834#comment-16689834
 ] 

Ted Yu commented on HBASE-21141:


You can shrink the following into one line comment:
{code}
+//although split fail, this may not affect following check
+//In old split without AM2, if region's best split key is not found,
+//there are not exception thrown. But in current API, exception
+//will be thrown.
{code}
That would be 3 fewer lines.

> Enable MOB in backup / restore test involving incremental backup
> 
>
> Key: HBASE-21141
> URL: https://issues.apache.org/jira/browse/HBASE-21141
> Project: HBase
>  Issue Type: Test
>  Components: backuprestore
>Reporter: Ted Yu
>Assignee: Artem Ervits
>Priority: Major
>  Labels: mob
> Attachments: HBASE-21141.v01.patch, HBASE-21141.v02.patch, 
> HBASE-21141.v03.patch
>
>
> Currently we only have one test (TestRemoteBackup) where MOB feature is 
> enabled. The test only performs full backup.
> This issue is to enable MOB in backup / restore test(s) involving incremental 
> backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21141) Enable MOB in backup / restore test involving incremental backup

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689825#comment-16689825
 ] 

Ted Yu commented on HBASE-21141:


w.r.t. the long method body, can you reduce method length to <= 150 lines by:

* dropping some debug logs
* removing some empty lines

Thanks

> Enable MOB in backup / restore test involving incremental backup
> 
>
> Key: HBASE-21141
> URL: https://issues.apache.org/jira/browse/HBASE-21141
> Project: HBase
>  Issue Type: Test
>  Components: backuprestore
>Reporter: Ted Yu
>Assignee: Artem Ervits
>Priority: Major
>  Labels: mob
> Attachments: HBASE-21141.v01.patch, HBASE-21141.v02.patch, 
> HBASE-21141.v03.patch
>
>
> Currently we only have one test (TestRemoteBackup) where MOB feature is 
> enabled. The test only performs full backup.
> This issue is to enable MOB in backup / restore test(s) involving incremental 
> backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689816#comment-16689816
 ] 

Ted Yu commented on HBASE-21246:


Patch v43 is mostly formatting change on top of v41.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.43.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-16 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.43.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.43.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21141) Enable MOB in backup / restore test involving incremental backup

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689701#comment-16689701
 ] 

Ted Yu commented on HBASE-21141:


We're close.
{code}
+  // #3 - incremental backup for multiple tables
{code}
#3 is repeated. Do you mind re-numbering the steps so that it is easier to 
follow ?

Please leave a blank line prior to each step for readability.
{code}
+  LOG.debug("mob has " + TEST_UTIL.countRows(hTable, mobName) + " rows");
+  Assert.assertEquals(TEST_UTIL.countRows(hTable, mobName), NB_ROWS_MOB);
{code}
countRows(hTable, mobName) is called twice - once for LOG and once for 
assertion. Can you store the count in a variable so that counting is called 
only once ?

Same applies to countRows(hTable, famName) and countRows(hTable, fam2Name).


> Enable MOB in backup / restore test involving incremental backup
> 
>
> Key: HBASE-21141
> URL: https://issues.apache.org/jira/browse/HBASE-21141
> Project: HBase
>  Issue Type: Test
>  Components: backuprestore
>Reporter: Ted Yu
>Assignee: Artem Ervits
>Priority: Major
>  Labels: mob
> Attachments: HBASE-21141.v01.patch, HBASE-21141.v02.patch
>
>
> Currently we only have one test (TestRemoteBackup) where MOB feature is 
> enabled. The test only performs full backup.
> This issue is to enable MOB in backup / restore test(s) involving incremental 
> backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-16 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689656#comment-16689656
 ] 

Ted Yu commented on HBASE-21387:


Haven't got around to adding new unit test (without introducing extra 
synchronization primitive in snapshot classes).

Zheng:
If you have bandwidth, you can give it a try.

Thanks

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688926#comment-16688926
 ] 

Ted Yu commented on HBASE-21387:


Looking at Zheng's suggestion for new unit test,

bq. another thread to invoke deleteFiles 
=SnapshotHFileCleaner#getDeletableFiles;

Since the in progress snapshot is really long, 
getUnreferencedFiles(Iterable, SnapshotManager) may detect the in 
progress snapshot and miss the race condition described in the description.

Also, I have never seen unit test creating 10K hfiles.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21141) Enable MOB in backup / restore test involving incremental backup

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688648#comment-16688648
 ] 

Ted Yu commented on HBASE-21141:


{code}
+mobHcd.setMobThreshold(0L);
{code}
Please increase the threshold.

Please add assertion on the restored table(s).

> Enable MOB in backup / restore test involving incremental backup
> 
>
> Key: HBASE-21141
> URL: https://issues.apache.org/jira/browse/HBASE-21141
> Project: HBase
>  Issue Type: Test
>  Components: backuprestore
>Reporter: Ted Yu
>Assignee: Artem Ervits
>Priority: Major
>  Labels: mob
> Attachments: HBASE-21141.v01.patch
>
>
> Currently we only have one test (TestRemoteBackup) where MOB feature is 
> enabled. The test only performs full backup.
> This issue is to enable MOB in backup / restore test(s) involving incremental 
> backup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21482) TestHRegion fails due to 'Too many open files'

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688550#comment-16688550
 ] 

Ted Yu commented on HBASE-21482:


Didn't reproduce the test failure in branch-2.

Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
2018-06-17T18:33:14Z)
Maven home: /apache-maven-3.5.4
Java version: 1.8.0_161, vendor: Oracle Corporation, runtime: /jdk1.8.0_161/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-327.28.3.el7.x86_64", arch: "amd64", family: 
"unix"

> TestHRegion fails due to 'Too many open files'
> --
>
> Key: HBASE-21482
> URL: https://issues.apache.org/jira/browse/HBASE-21482
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: 
> org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt, 
> org.apache.hadoop.hbase.regionserver.TestHRegion.txt
>
>
> TestHRegion fails due to 'Too many open files' in master branch.
> Here is one failed subtest :
> {code}
> testCheckAndDelete_ThatDeleteWasWritten(org.apache.hadoop.hbase.regionserver.TestHRegion)
>   Time elapsed: 2.373 sec  <<< ERROR!
> java.lang.IllegalStateException: failed to create a child event loop
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: org.apache.hbase.thirdparty.io.netty.channel.ChannelException: 
> failed to open a new selector
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: java.io.IOException: Too many open files
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21482) TestHRegion fails due to 'Too many open files'

2018-11-15 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21482:
---
Attachment: org.apache.hadoop.hbase.regionserver.TestHRegion.txt

> TestHRegion fails due to 'Too many open files'
> --
>
> Key: HBASE-21482
> URL: https://issues.apache.org/jira/browse/HBASE-21482
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: 
> org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt, 
> org.apache.hadoop.hbase.regionserver.TestHRegion.txt
>
>
> TestHRegion fails due to 'Too many open files' in master branch.
> Here is one failed subtest :
> {code}
> testCheckAndDelete_ThatDeleteWasWritten(org.apache.hadoop.hbase.regionserver.TestHRegion)
>   Time elapsed: 2.373 sec  <<< ERROR!
> java.lang.IllegalStateException: failed to create a child event loop
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: org.apache.hbase.thirdparty.io.netty.channel.ChannelException: 
> failed to open a new selector
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: java.io.IOException: Too many open files
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21482) TestHRegion fails due to 'Too many open files'

2018-11-15 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21482:
---
Attachment: org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt

> TestHRegion fails due to 'Too many open files'
> --
>
> Key: HBASE-21482
> URL: https://issues.apache.org/jira/browse/HBASE-21482
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: 
> org.apache.hadoop.hbase.regionserver.TestHRegion-output.txt, 
> org.apache.hadoop.hbase.regionserver.TestHRegion.txt
>
>
> TestHRegion fails due to 'Too many open files' in master branch.
> Here is one failed subtest :
> {code}
> testCheckAndDelete_ThatDeleteWasWritten(org.apache.hadoop.hbase.regionserver.TestHRegion)
>   Time elapsed: 2.373 sec  <<< ERROR!
> java.lang.IllegalStateException: failed to create a child event loop
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: org.apache.hbase.thirdparty.io.netty.channel.ChannelException: 
> failed to open a new selector
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> Caused by: java.io.IOException: Too many open files
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21482) TestHRegion fails due to 'Too many open files'

2018-11-15 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21482:
--

 Summary: TestHRegion fails due to 'Too many open files'
 Key: HBASE-21482
 URL: https://issues.apache.org/jira/browse/HBASE-21482
 Project: HBase
  Issue Type: Bug
Reporter: Ted Yu


TestHRegion fails due to 'Too many open files' in master branch.
Here is one failed subtest :
{code}
testCheckAndDelete_ThatDeleteWasWritten(org.apache.hadoop.hbase.regionserver.TestHRegion)
  Time elapsed: 2.373 sec  <<< ERROR!
java.lang.IllegalStateException: failed to create a child event loop
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
Caused by: org.apache.hbase.thirdparty.io.netty.channel.ChannelException: 
failed to open a new selector
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
Caused by: java.io.IOException: Too many open files
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4853)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4844)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.initHRegion(TestHRegion.java:4835)
at 
org.apache.hadoop.hbase.regionserver.TestHRegion.testCheckAndDelete_ThatDeleteWasWritten(TestHRegion.java:2034)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688506#comment-16688506
 ] 

Ted Yu commented on HBASE-21246:


Patch v41 fixes TestDLSAsyncFSWAL .

TestDLSAsyncFSWAL#countWAL was creating WALProvider in a loop which led to high 
resource consumption.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-15 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.41.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.41.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21479) TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent fails with IndexOutOfBoundsException

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688285#comment-16688285
 ] 

Ted Yu commented on HBASE-21479:


The single test used to pass.
e.g.
243e6cc5293dc1e2a4dfd3af4ee29087c84184c8

> TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
> fails with IndexOutOfBoundsException
> --
>
> Key: HBASE-21479
> URL: https://issues.apache.org/jira/browse/HBASE-21479
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: testHRegionReplayEvents-output.txt
>
>
> The test fails in both master branch and branch-2 :
> {code}
> testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
>   Time elapsed: 3.74 sec  <<< ERROR!
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21460) correct Document Configurable Bucket Sizes in bucketCache

2018-11-15 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21460:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, Yechao.

> correct Document Configurable Bucket Sizes in bucketCache
> -
>
> Key: HBASE-21460
> URL: https://issues.apache.org/jira/browse/HBASE-21460
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yechao Chen
>Assignee: Yechao Chen
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-21460.v1.patch, HBASE-21460.v2.patch
>
>
> we use the bucket cache(offheap),found the doc was error,
> the property bucket sizes shoul be "hbase.bucketcache.bucket.sizes"  instead 
> of "hfile.block.cache.sizes"
> CacheConfig.java
>  /**
>  * A comma-delimited array of values for use as bucket sizes.
>  */
>  public static final String BUCKET_CACHE_BUCKETS_KEY = 
> "hbase.bucketcache.bucket.sizes";
> the doc was:
> 
>  
>  HBASE-10641 introduced the ability to configure multiple sizes for the 
> buckets of the BucketCache, in HBase 0.98 and newer. To configurable multiple 
> bucket sizes, configure the new property 
> {color:#ff}{{hfile.block.cache.sizes}}{color} (instead of{color:#ff} 
> {{hfile.block.cache.size}}{color}) to a comma-separated list of block sizes, 
> ordered from smallest to largest, with no spaces. The goal is to optimize the 
> bucket sizes based on your data access patterns. The following example 
> configures buckets of size 4096 and 8192.
>   {color:#ff}hfile.block.cache.sizes{color} 
> 4096,8192 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21479) TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent fails with IndexOutOfBoundsException

2018-11-15 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688224#comment-16688224
 ] 

Ted Yu commented on HBASE-21479:


{code}
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 
2018-06-17T18:33:14Z)
Maven home: /apache-maven-3.5.4
Java version: 1.8.0_161, vendor: Oracle Corporation, runtime: /jdk1.8.0_161/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-327.28.3.el7.x86_64", arch: "amd64", family: 
"unix"
{code}
Here is the command which I used to produce the failure:

mvn clean test 
-Dtest=TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent

> TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
> fails with IndexOutOfBoundsException
> --
>
> Key: HBASE-21479
> URL: https://issues.apache.org/jira/browse/HBASE-21479
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: testHRegionReplayEvents-output.txt
>
>
> The test fails in both master branch and branch-2 :
> {code}
> testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
>   Time elapsed: 3.74 sec  <<< ERROR!
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21479) TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent fails with IndexOutOfBoundsException

2018-11-14 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687158#comment-16687158
 ] 

Ted Yu commented on HBASE-21479:


Progressively stepping back.
At :
9012a0b123b3eea8b08c8687cef812e83e9b491d

Still failing the same way.

> TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
> fails with IndexOutOfBoundsException
> --
>
> Key: HBASE-21479
> URL: https://issues.apache.org/jira/browse/HBASE-21479
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: testHRegionReplayEvents-output.txt
>
>
> The test fails in both master branch and branch-2 :
> {code}
> testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
>   Time elapsed: 3.74 sec  <<< ERROR!
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-14 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687051#comment-16687051
 ] 

Ted Yu commented on HBASE-21457:


See my comment above: HBASE-21466 cleared the way.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt, 
> 21457.v4.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21479) TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent fails with IndexOutOfBoundsException

2018-11-14 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21479:
---
Attachment: testHRegionReplayEvents-output.txt

> TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
> fails with IndexOutOfBoundsException
> --
>
> Key: HBASE-21479
> URL: https://issues.apache.org/jira/browse/HBASE-21479
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Major
> Attachments: testHRegionReplayEvents-output.txt
>
>
> The test fails in both master branch and branch-2 :
> {code}
> testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
>   Time elapsed: 3.74 sec  <<< ERROR!
> java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
>   at 
> org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21479) TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent fails with IndexOutOfBoundsException

2018-11-14 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21479:
--

 Summary: 
TestHRegionReplayEvents#testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent 
fails with IndexOutOfBoundsException
 Key: HBASE-21479
 URL: https://issues.apache.org/jira/browse/HBASE-21479
 Project: HBase
  Issue Type: Bug
Reporter: Ted Yu


The test fails in both master branch and branch-2 :
{code}
testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents)
  Time elapsed: 3.74 sec  <<< ERROR!
java.lang.IndexOutOfBoundsException: Index: 2, Size: 1
at 
org.apache.hadoop.hbase.regionserver.TestHRegionReplayEvents.testSkippingEditsWithSmallerSeqIdAfterRegionOpenEvent(TestHRegionReplayEvents.java:1042)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21460) correct Document Configurable Bucket Sizes in bucketCache

2018-11-14 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686827#comment-16686827
 ] 

Ted Yu commented on HBASE-21460:


bq. (instead of `hbase.bucketcache.bucket.size`)

I don't see the above property referenced in other part of the online reference.
I think you can remove the above snippet - referencing the actual config name 
should be good enough.

> correct Document Configurable Bucket Sizes in bucketCache
> -
>
> Key: HBASE-21460
> URL: https://issues.apache.org/jira/browse/HBASE-21460
> Project: HBase
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Yechao Chen
>Assignee: Yechao Chen
>Priority: Major
> Attachments: HBASE-21460.v1.patch
>
>
> we use the bucket cache(offheap),found the doc was error,
> the property bucket sizes shoul be "hbase.bucketcache.bucket.sizes"  instead 
> of "hfile.block.cache.sizes"
> CacheConfig.java
>  /**
>  * A comma-delimited array of values for use as bucket sizes.
>  */
>  public static final String BUCKET_CACHE_BUCKETS_KEY = 
> "hbase.bucketcache.bucket.sizes";
> the doc was:
> 
>  
>  HBASE-10641 introduced the ability to configure multiple sizes for the 
> buckets of the BucketCache, in HBase 0.98 and newer. To configurable multiple 
> bucket sizes, configure the new property 
> {color:#ff}{{hfile.block.cache.sizes}}{color} (instead of{color:#ff} 
> {{hfile.block.cache.size}}{color}) to a comma-separated list of block sizes, 
> ordered from smallest to largest, with no spaces. The goal is to optimize the 
> bucket sizes based on your data access patterns. The following example 
> configures buckets of size 4096 and 8192.
>   {color:#ff}hfile.block.cache.sizes{color} 
> 4096,8192 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21478) Make table sorted when displaying rsgroup info in shell and master web UI

2018-11-14 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686798#comment-16686798
 ] 

Ted Yu commented on HBASE-21478:


RSGroupInfo#getTables() references a SortedSet .
Do you plan to create another Set which is sorted lexicographically ?

> Make table sorted when displaying rsgroup info in shell and master web UI
> -
>
> Key: HBASE-21478
> URL: https://issues.apache.org/jira/browse/HBASE-21478
> Project: HBase
>  Issue Type: Improvement
>  Components: rsgroup
>Reporter: Xiang Li
>Assignee: Xiang Li
>Priority: Major
>
> Regarding the output of the command of "get_rsgoup" in hbase shell, or the 
> section of "Server Group" of HMaster's web UI, the tables are not sorted, so 
> not quite easy to read, like:
> {code}
> hbase(main):003:0> get_rsgroup 'default'
> GROUP INFORMATION
> ...
> Tables:
> table3
> ns2:table22
> table1
> ns1:table11
> ...
> {code}
> They could be sorted in the order of namespace then table name:
> {code}
> table1
> table3
> ns1:table11
> ns2:table22
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-14 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686778#comment-16686778
 ] 

Ted Yu commented on HBASE-21246:


There are fewer test failures with patch v39.

Need to handle failure in TestDLSAsyncFSWAL by reducing resource consumption.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-14 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.39.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.39.txt, 21246.HBASE-20952.001.patch, 
> 21246.HBASE-20952.002.patch, 21246.HBASE-20952.004.patch, 
> 21246.HBASE-20952.005.patch, 21246.HBASE-20952.007.patch, 
> 21246.HBASE-20952.008.patch, replication-src-creates-wal-reader.jpg, 
> wal-factory-providers.png, wal-providers.png, wal-splitter-reader.jpg, 
> wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-13 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685515#comment-16685515
 ] 

Ted Yu commented on HBASE-21246:


Patch v37 fixes infinite call in RegionGroupingProvider#createWALIdentity

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-13 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.37.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.37.txt, 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-13 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685384#comment-16685384
 ] 

Ted Yu edited comment on HBASE-21457 at 11/13/18 3:50 PM:
--

Thanks for the review, Stephen and Vlad.


was (Author: yuzhih...@gmail.com):
Thanks for the review, Stephen.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt, 
> 21457.v4.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-13 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21457:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the review, Stephen.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt, 
> 21457.v4.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684625#comment-16684625
 ] 

Ted Yu edited comment on HBASE-21387 at 11/13/18 2:18 AM:
--

The two minor comments have been addressed in latest patch.
The only remaining comment was about a new test, right ?

Hopefully I can get to the test later this week.


was (Author: yuzhih...@gmail.com):
The only remaining comment was about a new test, right ?

Not sure when I can get to it this week.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684625#comment-16684625
 ] 

Ted Yu commented on HBASE-21387:


The only remaining comment was about a new test, right ?

Not sure when I can get to it this week.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21457:
---
Attachment: 21457.v4.txt

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt, 
> 21457.v4.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.2.0
   3.0.0
   Status: Resolved  (was: Patch Available)

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0, 2.2.0
>
> Attachments: 21466.v2.txt, 21466.v3.txt, 21466.v3.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684378#comment-16684378
 ] 

Ted Yu commented on HBASE-21466:


Without the leading slash (ahead of 'tmp'), test would fail with:
{code}
[ERROR] 
testWalAbortOnLowReplicationWithQueuedWriters(org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS)
  Time elapsed: 1.4 s  <<< ERROR!
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: hdfs://localhost:37261tmp/wal
at 
org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS.setupDFS(TestWALProcedureStoreOnHDFS.java:88)
{code}

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt, 21466.v3.txt, 21466.v3.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Attachment: 21466.v3.txt

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt, 21466.v3.txt, 21466.v3.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684373#comment-16684373
 ] 

Ted Yu commented on HBASE-21387:


TestBlockEvictionFromClient failure was unrelated to patch.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684197#comment-16684197
 ] 

Ted Yu commented on HBASE-21387:


TestSnapshotFileCache failed due to NPE, as pointed out by findbugs.

TestSaslFanOutOneBlockAsyncDFSOutput failure was due to port in use - not 
related to the patch.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21387.v12.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v12.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 
> 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Attachment: 21466.v3.txt

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt, 21466.v3.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684169#comment-16684169
 ] 

Ted Yu commented on HBASE-21466:


Patch v3 addresses review comments above.

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt, 21466.v3.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Attachment: 21466.v2.txt

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Attachment: (was: 21466.v1.txt)

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v2.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-12 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683916#comment-16683916
 ] 

Ted Yu commented on HBASE-21457:


HBASE-21466 needs to be committed first.
Without HBASE-21466, master wouldn't initialize when wal.dir is set to 
directory not under rootdir.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-12 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21387.v11.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v11.txt, 
> 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 
> 21387.v9.txt, two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, 
> two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-11 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683206#comment-16683206
 ] 

Ted Yu edited comment on HBASE-21387 at 11/12/18 5:22 AM:
--

https://reviews.apache.org/r/69316/

Adding a test may take some time. More than one countdown latch would be needed 
to control the timing of when snapshot is moved in place. The introduction of 
the countdown latches, solely for test purposes, seems to be not ideal.




was (Author: yuzhih...@gmail.com):
https://reviews.apache.org/r/69316/

Adding a test may take some time. More than one countdown latch would be needed 
to control the timing of when snapshot is moved in place. The introduction of 
the countdown latches, solely for test purposes, seems to be not ideal.

BTW I also have HBASE-21246 and HBASE-21466 going in parallel.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-11 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683206#comment-16683206
 ] 

Ted Yu commented on HBASE-21387:


https://reviews.apache.org/r/69316/

Adding a test may take some time. More than one countdown latch would be needed 
to control the timing of when snapshot is moved in place. The introduction of 
the countdown latches, solely for test purposes, seems to be not ideal.

BTW I also have HBASE-21246 and HBASE-21466 going in parallel.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-11 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21387:
---
Attachment: 21387.v10.txt

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v10.txt, 21387.v2.txt, 
> 21387.v3.txt, 21387.v6.txt, 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, 
> two-pass-cleaner.v4.txt, two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-11 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Summary: WALProcedureStore uses wrong FileSystem if wal.dir is not under 
rootdir  (was: WALProcedureStore uses wrong FileSystem if wal.dir is on 
different FileSystem as rootdir)

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v1.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is on different FileSystem as rootdir, the above would 
> return wrong FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-11 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682946#comment-16682946
 ] 

Ted Yu commented on HBASE-21466:


Here is snippet from test output without fix:
{code}
2018-11-11 09:12:23,731 DEBUG [WALProcedureStoreSyncThread] 
wal.WALProcedureStore(1229): Removed 
log=file:/tmp/wal/MasterProcWALs/pv2-0005.log, 
activeLogs=[file:/tmp/wal/MasterProcWALs/pv2-0006.log, 
file:/tmp/wal/MasterProcWALs/pv2-0007.log]
2018-11-11 09:12:23,731 INFO  [WALProcedureStoreSyncThread] 
wal.ProcedureWALFile(160): Archiving 
file:/tmp/wal/MasterProcWALs/pv2-0006.log to 
file:/tmp/wal/oldWALs/pv2-0006.log
2018-11-11 09:12:23,732 DEBUG [WALProcedureStoreSyncThread] 
wal.WALProcedureStore(1229): Removed 
log=file:/tmp/wal/MasterProcWALs/pv2-0006.log, 
activeLogs=[file:/tmp/wal/MasterProcWALs/pv2-0007.log]
Process Thread Dump: Thread dump because: Master not initialized after 20ms
{code}

> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v1.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir

2018-11-11 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Description: 
In WALProcedureStore ctor , the fs field is initialized this way:
{code}
this.fs = walDir.getFileSystem(conf);
{code}
However, when wal.dir is not under rootdir, the above would return wrong 
FileSystem.
In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
initialize.

  was:
In WALProcedureStore ctor , the fs field is initialized this way:
{code}
this.fs = walDir.getFileSystem(conf);
{code}
However, when wal.dir is on different FileSystem as rootdir, the above would 
return wrong FileSystem.
In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
initialize.


> WALProcedureStore uses wrong FileSystem if wal.dir is not under rootdir
> ---
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v1.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is not under rootdir, the above would return wrong 
> FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is on different FileSystem as rootdir

2018-11-11 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Attachment: 21466.v1.txt

> WALProcedureStore uses wrong FileSystem if wal.dir is on different FileSystem 
> as rootdir
> 
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v1.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is on different FileSystem as rootdir, the above would 
> return wrong FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is on different FileSystem as rootdir

2018-11-11 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21466:
---
Status: Patch Available  (was: Open)

> WALProcedureStore uses wrong FileSystem if wal.dir is on different FileSystem 
> as rootdir
> 
>
> Key: HBASE-21466
> URL: https://issues.apache.org/jira/browse/HBASE-21466
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21466.v1.txt
>
>
> In WALProcedureStore ctor , the fs field is initialized this way:
> {code}
> this.fs = walDir.getFileSystem(conf);
> {code}
> However, when wal.dir is on different FileSystem as rootdir, the above would 
> return wrong FileSystem.
> In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
> initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (HBASE-21466) WALProcedureStore uses wrong FileSystem if wal.dir is on different FileSystem as rootdir

2018-11-11 Thread Ted Yu (JIRA)
Ted Yu created HBASE-21466:
--

 Summary: WALProcedureStore uses wrong FileSystem if wal.dir is on 
different FileSystem as rootdir
 Key: HBASE-21466
 URL: https://issues.apache.org/jira/browse/HBASE-21466
 Project: HBase
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Ted Yu


In WALProcedureStore ctor , the fs field is initialized this way:
{code}
this.fs = walDir.getFileSystem(conf);
{code}
However, when wal.dir is on different FileSystem as rootdir, the above would 
return wrong FileSystem.
In the modified TestMasterProcedureEvents, without fix, the master wouldn't 
initialize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-10 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682644#comment-16682644
 ] 

Ted Yu commented on HBASE-21387:


One more note about why I choose 21387.v9.txt as the version for review:

priority is given to taking snapshot versus (delaying) cleaning snapshot files.
This is because a failed snapshot has higher visibility compared to delayed 
snapshot cleaning.



> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, 
> two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-10 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682638#comment-16682638
 ] 

Ted Yu commented on HBASE-21246:


Currently there are about 69 failing test classes.

Working through these failed tests.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-13468) hbase.zookeeper.quorum supports ipv6 address

2018-11-10 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-13468:
---
   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the patch, maoling

Thanks for the review, Mike

> hbase.zookeeper.quorum supports ipv6 address
> 
>
> Key: HBASE-13468
> URL: https://issues.apache.org/jira/browse/HBASE-13468
> Project: HBase
>  Issue Type: Bug
>Reporter: Mingtao Zhang
>Assignee: maoling
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: HBASE-13468.master.001.patch, 
> HBASE-13468.master.002.patch, HBASE-13468.master.003.patch, 
> HBASE-13468.master.004.patch
>
>
> I put ipv6 address in hbase.zookeeper.quorum, by the time this string went to 
> zookeeper code, the address is messed up, i.e. only '[1234' left. 
> I started using pseudo mode with embedded zk = true.
> I downloaded 1.0.0, not sure which affected version should be here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21341) DeadServer shouldn't import unshaded Preconditions

2018-11-09 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21341:
---
Fix Version/s: 3.0.0

> DeadServer shouldn't import unshaded Preconditions
> --
>
> Key: HBASE-21341
> URL: https://issues.apache.org/jira/browse/HBASE-21341
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: 21341.v1.txt
>
>
> DeadServer currently imports unshaded Preconditions :
> {code}
> import com.google.common.base.Preconditions;
> {code}
> We should import shaded version of Preconditions.
> This is the only place where unshaded class from com.google.common is imported



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-09 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681979#comment-16681979
 ] 

Ted Yu commented on HBASE-21246:


Patch v34 is based on current master.

Running test suite locally to see which tests fail.

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21246) Introduce WALIdentity interface

2018-11-09 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21246:
---
Attachment: 21246.34.txt

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 21246.34.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-09 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681855#comment-16681855
 ] 

Ted Yu commented on HBASE-21457:


The master startup delay probably is related to procedure store WAL - store WAL 
would be using the designated hdfs .

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-09 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681852#comment-16681852
 ] 

Ted Yu commented on HBASE-21457:


The failed tests are already marked large test.

Looking for where the timeout should be increased.
200 seconds were really long. Though I don't see meaningful exception in test 
output related to master initialization.


> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-09 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681788#comment-16681788
 ] 

Ted Yu commented on HBASE-21457:


Shall we leave the test change to another JIRA ?

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21387) Race condition surrounding in progress snapshot handling in snapshot cache leads to loss of snapshot files

2018-11-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680721#comment-16680721
 ] 

Ted Yu commented on HBASE-21387:


[~openinx][~Apache9][~elserj] :
Gentle ping.

> Race condition surrounding in progress snapshot handling in snapshot cache 
> leads to loss of snapshot files
> --
>
> Key: HBASE-21387
> URL: https://issues.apache.org/jira/browse/HBASE-21387
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
>  Labels: snapshot
> Attachments: 21387.dbg.txt, 21387.v2.txt, 21387.v3.txt, 21387.v6.txt, 
> 21387.v7.txt, 21387.v8.txt, 21387.v9.txt, two-pass-cleaner.v4.txt, 
> two-pass-cleaner.v6.txt, two-pass-cleaner.v9.txt
>
>
> During recent report from customer where ExportSnapshot failed:
> {code}
> 2018-10-09 18:54:32,559 ERROR [VerifySnapshot-pool1-t2] 
> snapshot.SnapshotReferenceUtil: Can't find hfile: 
> 44f6c3c646e84de6a63fe30da4fcb3aa in the real 
> (hdfs://in.com:8020/apps/hbase/data/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  or archive 
> (hdfs://in.com:8020/apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa)
>  directory for the primary table. 
> {code}
> We found the following in log:
> {code}
> 2018-10-09 18:54:23,675 DEBUG 
> [00:16000.activeMasterManager-HFileCleaner.large-1539035367427] 
> cleaner.HFileCleaner: Removing: 
> hdfs:///apps/hbase/data/archive/data/.../a/44f6c3c646e84de6a63fe30da4fcb3aa 
> from archive
> {code}
> The root cause is race condition surrounding in progress snapshot(s) handling 
> between refreshCache() and getUnreferencedFiles().
> There are two callers of refreshCache: one from RefreshCacheTask#run and the 
> other from SnapshotHFileCleaner.
> Let's look at the code of refreshCache:
> {code}
>   if (!name.equals(SnapshotDescriptionUtils.SNAPSHOT_TMP_DIR_NAME)) {
> {code}
> whose intention is to exclude in progress snapshot(s).
> Suppose when the RefreshCacheTask runs refreshCache, there is some in 
> progress snapshot (about to finish).
> When SnapshotHFileCleaner calls getUnreferencedFiles(), it sees that 
> lastModifiedTime is up to date. So cleaner proceeds to check in progress 
> snapshot(s). However, the snapshot has completed by that time, resulting in 
> some file(s) deemed unreferenced.
> Here is timeline given by Josh illustrating the scenario:
> At time T0, we are checking if F1 is referenced. At time T1, there is a 
> snapshot S1 in progress that is referencing a file F1. refreshCache() is 
> called, but no completed snapshot references F1. At T2, the snapshot S1, 
> which references F1, completes. At T3, we check in-progress snapshots and S1 
> is not included. Thus, F1 is marked as unreferenced even though S1 references 
> it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21457:
---
Attachment: 21457.v3.txt

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21246) Introduce WALIdentity interface

2018-11-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680549#comment-16680549
 ] 

Ted Yu commented on HBASE-21246:


Looking at compilation errors for patch v22: 321 lines in the compilation 
output.

It would be non-trivial to make v22 or v23 pass both compilation and unit test 
suite.
Also, v26 is the closest to what we want WALFactory and WALProvider to be.
So it would be nice if we can build upon v26, possibly using HBASE-21456

> Introduce WALIdentity interface
> ---
>
> Key: HBASE-21246
> URL: https://issues.apache.org/jira/browse/HBASE-21246
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Major
> Fix For: HBASE-20952
>
> Attachments: 21246.003.patch, 21246.20.txt, 21246.21.txt, 
> 21246.23.txt, 21246.24.txt, 21246.25.txt, 21246.26.txt, 
> 21246.HBASE-20952.001.patch, 21246.HBASE-20952.002.patch, 
> 21246.HBASE-20952.004.patch, 21246.HBASE-20952.005.patch, 
> 21246.HBASE-20952.007.patch, 21246.HBASE-20952.008.patch, 
> replication-src-creates-wal-reader.jpg, wal-factory-providers.png, 
> wal-providers.png, wal-splitter-reader.jpg, wal-splitter-writer.jpg
>
>
> We are introducing WALIdentity interface so that the WAL representation can 
> be decoupled from distributed filesystem.
> The interface provides getName method whose return value can represent 
> filename in distributed filesystem environment or, the name of the stream 
> when the WAL is backed by log stream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680531#comment-16680531
 ] 

Ted Yu commented on HBASE-21457:


The test passed during local run.

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 102.933 sec - 
in org.apache.hadoop.hbase.backup.TestRemoteRestore


> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680500#comment-16680500
 ] 

Ted Yu commented on HBASE-21457:


I tried to retrieve output for the failed test but it looks like archiving 
wasn't successful.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680435#comment-16680435
 ] 

Ted Yu commented on HBASE-21457:


I did a search in hbase-backup module where we retrieve FileSystem.

The one fixed in the patch is the only one I found for WAL.

If you know of any other call(s) for WAL FS which should be changed, please let 
me know.

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21457:
---
Attachment: 21457.v3.txt

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt, 21457.v3.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (HBASE-21457) BackupUtils#getWALFilesOlderThan refers to wrong FileSystem

2018-11-08 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-21457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-21457:
---
Attachment: (was: 21457.v3.txt)

> BackupUtils#getWALFilesOlderThan refers to wrong FileSystem
> ---
>
> Key: HBASE-21457
> URL: https://issues.apache.org/jira/browse/HBASE-21457
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Janos Gub
>Assignee: Ted Yu
>Priority: Major
> Attachments: 21457.v1.txt, 21457.v2.txt
>
>
> Janos reported seeing backup test failure when testing a local HDFS for WALs 
> while using WASB/ADLS only for store files.
> Janos spotted the code in BackupUtils#getWALFilesOlderThan which uses HBase 
> root dir for retrieving WAL files.
> We should use the helper methods from CommonFSUtils.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   5   6   7   8   9   10   >