[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256374#comment-16256374 ] xinxin fan edited comment on HBASE-18090 at 11/17/17 2:38 AM: -- Thank you [~stack], i have add a release note. another thing, i notice that there are many build fails recently, especially for branch-1 patch, here https://builds.apache.org/job/PreCommit-HBASE-Build. any reason? was (Author: xinxin fan): Thank you stack, i have add a release note. another thing, i notice that there are many build fails recently, especially for branch-1 patch, here https://builds.apache.org/job/PreCommit-HBASE-Build. any reason? > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Mikhail Antonov >Assignee: xinxin fan > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-18090-V3-master.patch, > HBASE-18090-V4-master.patch, HBASE-18090-V5-master.patch, > HBASE-18090-branch-1-v2.patch, HBASE-18090-branch-1-v2.patch, > HBASE-18090-branch-1.3-v1.patch, HBASE-18090-branch-1.3-v2.patch, > HBASE-18090.branch-1.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256374#comment-16256374 ] xinxin fan edited comment on HBASE-18090 at 11/17/17 2:37 AM: -- Thank you stack, i have add a release note. another thing, i notice that there are many build fails recently, especially for branch-1 patch, here https://builds.apache.org/job/PreCommit-HBASE-Build. any reason? was (Author: xinxin fan): Thank you @stack, i have add a release note. another thing, i notice that there are many build fails recently, especially for branch-1 patch, here https://builds.apache.org/job/PreCommit-HBASE-Build. any reason? > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Mikhail Antonov >Assignee: xinxin fan > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-18090-V3-master.patch, > HBASE-18090-V4-master.patch, HBASE-18090-V5-master.patch, > HBASE-18090-branch-1-v2.patch, HBASE-18090-branch-1-v2.patch, > HBASE-18090-branch-1.3-v1.patch, HBASE-18090-branch-1.3-v2.patch, > HBASE-18090.branch-1.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16246793#comment-16246793 ] xinxin fan edited comment on HBASE-18090 at 11/9/17 11:57 PM: -- Manually triggered QA retry to build branch-1. was (Author: xinxin fan): Manually triggered QA retry. > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Mikhail Antonov >Assignee: xinxin fan > Fix For: 2.0.0-beta-1 > > Attachments: HBASE-18090-V3-master.patch, > HBASE-18090-V4-master.patch, HBASE-18090-V5-master.patch, > HBASE-18090-branch-1.3-v1.patch, HBASE-18090-branch-1.3-v2.patch, > HBASE-18090.branch-1.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167191#comment-16167191 ] xinxin fan edited comment on HBASE-18090 at 9/15/17 1:27 AM: - [~mantonov] I have create patch for master and test it on the real cluster, now i have put the patch on review board : https://reviews.apache.org/r/62267/ was (Author: xinxin fan): [~mantonov] I have create patch for master and test it on the real cluster, now i have put the patch on review board : https://reviews.apache.org/r/62267/ , could you have a look? > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov >Assignee: xinxin fan > Attachments: HBASE-18090-branch-1.3-v1.patch, > HBASE-18090-branch-1.3-v2.patch, HBASE-18090-V3-master.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164090#comment-16164090 ] xinxin fan edited comment on HBASE-18090 at 9/13/17 3:19 AM: - [~ted_yu] [~mantonov] I have create a patch for master and put it on review board:) : https://reviews.apache.org/r/62267/ was (Author: xinxin fan): mantonov Mikhail Antonov I have create a patch for master and put it on review board : https://reviews.apache.org/r/62267/ > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov >Assignee: xinxin fan > Attachments: HBASE-18090-branch-1.3-v1.patch, > HBASE-18090-branch-1.3-v2.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159594#comment-16159594 ] xinxin fan edited comment on HBASE-18090 at 9/9/17 12:38 AM: - Thanks for your review! {quote}Before I go in reviews..opening regions in read-only mode for snapshots seems reasonable. That change would only affect MR over snapshots codebase or some other paths too?{quote} I think the read-only regions only affect MR over snapshots codebase. {quote} if we set readonly flag we skip replaying WAL and don't create those tmp files. {quote} It seem that primary regions even opened in read only mode should replay the edits, just see HRegion.#initializeRegionInternals: {code:java} if (ServerRegionReplicaUtil.shouldReplayRecoveredEdits(this)) { // Recover any edits if available. maxSeqId = Math.max(maxSeqId, replayRecoveredEditsIfAny(this.fs.getRegionDir(), maxSeqIdInStores, reporter, status)); // Make sure mvcc is up to max. this.mvcc.advanceTo(maxSeqId); } {code} {quote}Will that work for snapshots created with skipFlush option? Is it always safe to skip WAL in that case?{quote} The MR just work on the snapshot store files, so i think it make no different if the region is read-only or not. How do you think? was (Author: xinxin fan): [[mailto:Mikhail Antonov]] Thanks for your review! {quote}Before I go in reviews..opening regions in read-only mode for snapshots seems reasonable. That change would only affect MR over snapshots codebase or some other paths too?{quote} I think the read-only regions only affect MR over snapshots codebase. {quote} if we set readonly flag we skip replaying WAL and don't create those tmp files. {quote} It seem that primary regions even opened in read only mode should replay the edits, just see HRegion.#initializeRegionInternals: {code:java} if (ServerRegionReplicaUtil.shouldReplayRecoveredEdits(this)) { // Recover any edits if available. maxSeqId = Math.max(maxSeqId, replayRecoveredEditsIfAny(this.fs.getRegionDir(), maxSeqIdInStores, reporter, status)); // Make sure mvcc is up to max. this.mvcc.advanceTo(maxSeqId); } {code} {quote}Will that work for snapshots created with skipFlush option? Is it always safe to skip WAL in that case?{quote} The MR just work on the snapshot store files, so i think it make no different if the region is read-only or not. How do you think? > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov >Assignee: xinxin fan > Attachments: HBASE-18090-branch-1.3-v1.patch, > HBASE-18090-branch-1.3-v2.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158394#comment-16158394 ] xinxin fan edited comment on HBASE-18090 at 9/8/17 11:25 AM: - patch v2 (rebased branch-1.3) for review was (Author: xinxin fan): patch v2 for review > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov >Assignee: xinxin fan > Attachments: HBASE-18090-branch-1.3-v1.patch, > HBASE-18090-branch-1.3-v2.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158394#comment-16158394 ] xinxin fan edited comment on HBASE-18090 at 9/8/17 11:20 AM: - patch v2 for review was (Author: xinxin fan): patch v2 for review > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov >Assignee: xinxin fan > Attachments: HBASE-18090-branch-1.3-v1.patch, > HBASE-18090-branch-1.3-v2.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (HBASE-18090) Improve TableSnapshotInputFormat to allow more multiple mappers per region
[ https://issues.apache.org/jira/browse/HBASE-18090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158178#comment-16158178 ] xinxin fan edited comment on HBASE-18090 at 9/8/17 6:21 AM: Year, if the region is read-only, the writesEnabled will be false, and the regionSequenceIdFile will not created. {code:java} if (this.writestate.writesEnabled) { nextSeqid = WALSplitter.writeRegionSequenceIdFile(this.fs.getFileSystem(), this.fs .getRegionDir(), nextSeqid, (this.recovering ? (this.flushPerChanges + 1000) : 1)); } else { nextSeqid++; } {code} And i notice the fact that a given region will be set read-only mode only when the table is read-only or the region is not default replica. so it seem feasible that set the HTableDescriptor read-only before the region open (in ClientSideRegionScanner.java): {code:java} public ClientSideRegionScanner(Configuration conf, FileSystem fs, Path rootDir, HTableDescriptor htd, HRegionInfo hri, Scan scan, ScanMetrics scanMetrics) throws IOException { // region is immutable, set isolation level scan.setIsolationLevel(IsolationLevel.READ_UNCOMMITTED); // region should be set read only htd.setReadOnly(true); // open region from the snapshot directory this.region = HRegion.openHRegion(conf, fs, rootDir, hri, htd, null, null, null); {code} I have tested the plan, the split tasks works well and the exception disappear. was (Author: xinxin fan): Year, if the region is read-only, the writesEnabled will be false, and the regionSequenceIdFile will not created. {code:java} if (this.writestate.writesEnabled) { nextSeqid = WALSplitter.writeRegionSequenceIdFile(this.fs.getFileSystem(), this.fs .getRegionDir(), nextSeqid, (this.recovering ? (this.flushPerChanges + 1000) : 1)); } else { nextSeqid++; } {code} And i notice the fact that a given region will be set read-only mode only when the table is read-only or the region is not default replica. so it seem feasible that set the HTableDescriptor read-only before the region open (in ClientSideRegionScanner.java): {code:java} public ClientSideRegionScanner(Configuration conf, FileSystem fs, Path rootDir, HTableDescriptor htd, HRegionInfo hri, Scan scan, ScanMetrics scanMetrics) throws IOException { // region is immutable, set isolation level scan.setIsolationLevel(IsolationLevel.READ_UNCOMMITTED); htd.setReadOnly(true); // open region from the snapshot directory this.region = HRegion.openHRegion(conf, fs, rootDir, hri, htd, null, null, null); {code} I have tested the plan, the split tasks works well and the exception disappear. > Improve TableSnapshotInputFormat to allow more multiple mappers per region > -- > > Key: HBASE-18090 > URL: https://issues.apache.org/jira/browse/HBASE-18090 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Affects Versions: 1.4.0 >Reporter: Mikhail Antonov > Attachments: HBASE-18090-branch-1.3-v1.patch > > > TableSnapshotInputFormat runs one map task per region in the table snapshot. > This places unnecessary restriction that the region layout of the original > table needs to take the processing resources available to MR job into > consideration. Allowing to run multiple mappers per region (assuming > reasonably even key distribution) would be useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029)