[jira] [Commented] (HBASE-21874) Bucket cache on Persistent memory
[ https://issues.apache.org/jira/browse/HBASE-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769508#comment-16769508 ] Jean-Daniel Cryans commented on HBASE-21874: Thank you Anoop and Ram for addressing my questions. Some follow-ups: bq. We can remove that part if it is distracting but it may be needed if you need to create bigger sized cache. It seems like a java restriction from mmapping buffers. We should document this limitation and add a check in the code for the moment. bq. So it becomes a direct replacement of the DRAM based IOEngine. ( you don't copy the buffers onheap) That's the part I don't get. ByteBufferIOEngine basically offers the same functionality if we don't care about persistence and the Intel DCPMM comes in a DIMM form factor so shouldn't that _just work_? > Bucket cache on Persistent memory > - > > Key: HBASE-21874 > URL: https://issues.apache.org/jira/browse/HBASE-21874 > Project: HBase > Issue Type: New Feature > Components: BucketCache >Affects Versions: 3.0.0 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21874.patch, HBASE-21874.patch, > HBASE-21874_V2.patch, Pmem_BC.png > > > Non volatile persistent memory devices are byte addressable like DRAM (for > eg. Intel DCPMM). Bucket cache implementation can take advantage of this new > memory type and can make use of the existing offheap data structures to serve > data directly from this memory area without having to bring the data to > onheap. > The patch is a new IOEngine implementation that works with the persistent > memory. > Note : Here we don't make use of the persistence nature of the device and > just make use of the big memory it provides. > Performance numbers to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21874) Bucket cache on Persistent memory
[ https://issues.apache.org/jira/browse/HBASE-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769510#comment-16769510 ] Jean-Daniel Cryans commented on HBASE-21874: Oh additionally maybe we shouldn't call this PmemIOEngine but something more generic like SharedMemoryFileMmapEngine if it's not going to be persistent. > Bucket cache on Persistent memory > - > > Key: HBASE-21874 > URL: https://issues.apache.org/jira/browse/HBASE-21874 > Project: HBase > Issue Type: New Feature > Components: BucketCache >Affects Versions: 3.0.0 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21874.patch, HBASE-21874.patch, > HBASE-21874_V2.patch, Pmem_BC.png > > > Non volatile persistent memory devices are byte addressable like DRAM (for > eg. Intel DCPMM). Bucket cache implementation can take advantage of this new > memory type and can make use of the existing offheap data structures to serve > data directly from this memory area without having to bring the data to > onheap. > The patch is a new IOEngine implementation that works with the persistent > memory. > Note : Here we don't make use of the persistence nature of the device and > just make use of the big memory it provides. > Performance numbers to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21874) Bucket cache on Persistent memory
[ https://issues.apache.org/jira/browse/HBASE-21874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768748#comment-16768748 ] Jean-Daniel Cryans commented on HBASE-21874: Hi Ram. Two questions: - Why add the configurable buffer size? That seems to be what most of the patch is about which is distracting. - You say that the patch doesn't make use of the persistent nature of the device, but PmemIOEngine extends FileMmapEngine which advertises itself as persistent so it does a few other things on top. I might be confused about what it implies but this is also confusing :) > Bucket cache on Persistent memory > - > > Key: HBASE-21874 > URL: https://issues.apache.org/jira/browse/HBASE-21874 > Project: HBase > Issue Type: New Feature > Components: BucketCache >Affects Versions: 3.0.0 >Reporter: ramkrishna.s.vasudevan >Assignee: ramkrishna.s.vasudevan >Priority: Major > Fix For: 3.0.0 > > Attachments: HBASE-21874.patch, HBASE-21874.patch, Pmem_BC.png > > > Non volatile persistent memory devices are byte addressable like DRAM (for > eg. Intel DCPMM). Bucket cache implementation can take advantage of this new > memory type and can make use of the existing offheap data structures to serve > data directly from this memory area without having to bring the data to > onheap. > The patch is a new IOEngine implementation that works with the persistent > memory. > Note : Here we don't make use of the persistence nature of the device and > just make use of the big memory it provides. > Performance numbers to follow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-12169) Document IPC binding options
[ https://issues.apache.org/jira/browse/HBASE-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069042#comment-16069042 ] Jean-Daniel Cryans commented on HBASE-12169: It's tools _and_ practices on how to run a cluster, I think it fits. > Document IPC binding options > > > Key: HBASE-12169 > URL: https://issues.apache.org/jira/browse/HBASE-12169 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 0.98.0, 0.94.7, 0.99.0 >Reporter: Sean Busbey >Assignee: Mike Drob >Priority: Minor > Attachments: HBASE-12169.patch > > > HBASE-8148 added options to change binding component services, but there > aren't any docs for it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12169) Document IPC binding options
[ https://issues.apache.org/jira/browse/HBASE-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068722#comment-16068722 ] Jean-Daniel Cryans commented on HBASE-12169: Instead of adding this to hbase-default, one option could be to document "How to have HBase processes bind to specific addresses", maybe on this page http://hbase.apache.org/book.html#ops_mgt ? > Document IPC binding options > > > Key: HBASE-12169 > URL: https://issues.apache.org/jira/browse/HBASE-12169 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 0.98.0, 0.94.7, 0.99.0 >Reporter: Sean Busbey >Assignee: Mike Drob >Priority: Minor > Attachments: HBASE-12169.patch > > > HBASE-8148 added options to change binding component services, but there > aren't any docs for it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-12169) Document IPC binding options
[ https://issues.apache.org/jira/browse/HBASE-12169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068578#comment-16068578 ] Jean-Daniel Cryans commented on HBASE-12169: Hey Mike, I could see a case for documenting why someone would want to change those configs, have you considered it? Also, and that might show how long it's been since I've looked at the HBase docs, but isn't the hbase-default stuff just pulled from hbase-default.xml? The configs you are documenting aren't in that file AFAIK, so it should either be documented elsewhere or the configs have to be pulled into hbase-default.xml. > Document IPC binding options > > > Key: HBASE-12169 > URL: https://issues.apache.org/jira/browse/HBASE-12169 > Project: HBase > Issue Type: Task > Components: documentation >Affects Versions: 0.98.0, 0.94.7, 0.99.0 >Reporter: Sean Busbey >Assignee: Mike Drob >Priority: Minor > Attachments: HBASE-12169.patch > > > HBASE-8148 added options to change binding component services, but there > aren't any docs for it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-13135) Move replication ops mgmt stuff from Javadoc to Ref Guide
[ https://issues.apache.org/jira/browse/HBASE-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349168#comment-14349168 ] Jean-Daniel Cryans commented on HBASE-13135: Good stuff Misty, a few more things while we're here. I think we can remove this line since it's on by default, maybe say that the operator should check it wasn't disabled: bq. On the source cluster, enable replication by setting `hbase.replication` to `true` in _hbase-site.xml_. The following can be removed, we don't print this out at least since the endpoint changes: {quote} Considering 1 rs, with ratio 0.1 Getting 1 rs from peer cluster # 0 Choosing peer 10.10.1.49:62020 {quote} This is a more reliable thing to point out instead, coming from ReplicationSource: bq. LOG.info("Replicating "+clusterId + " -> " + peerClusterId); > Move replication ops mgmt stuff from Javadoc to Ref Guide > - > > Key: HBASE-13135 > URL: https://issues.apache.org/jira/browse/HBASE-13135 > Project: HBase > Issue Type: Bug > Components: documentation, Replication >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-13135-v1.patch, HBASE-13135-v2.patch, > HBASE-13135.patch > > > As per discussion with [~jmhsieh] and [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13135) Move replication ops mgmt stuff from Javadoc to Ref Guide
[ https://issues.apache.org/jira/browse/HBASE-13135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14343406#comment-14343406 ] Jean-Daniel Cryans commented on HBASE-13135: Looking at the current doc, this line should be changed since at the end it includes a link that points back to the doc we're moving: bq. The following is a simplified procedure for configuring cluster replication. It may not cover every edge case. For more information, see the API documentation for replication. This whole section is very similar to the tail of the "Configuring Cluster Replication" section (merge or skip?): bq. + Verifying Replicated Data > Move replication ops mgmt stuff from Javadoc to Ref Guide > - > > Key: HBASE-13135 > URL: https://issues.apache.org/jira/browse/HBASE-13135 > Project: HBase > Issue Type: Bug > Components: documentation, Replication >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-13135.patch > > > As per discussion with [~jmhsieh] and [~saint@gmail.com] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-7631) Region w/ few, fat reads was hard to find on a box carrying hundreds of regions
[ https://issues.apache.org/jira/browse/HBASE-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273783#comment-14273783 ] Jean-Daniel Cryans commented on HBASE-7631: --- [~clehene] Things have changed a lot since then so I'm not sure what kind of work is required now. I'd close this unless someone is planning on working on it now that there's noise. > Region w/ few, fat reads was hard to find on a box carrying hundreds of > regions > --- > > Key: HBASE-7631 > URL: https://issues.apache.org/jira/browse/HBASE-7631 > Project: HBase > Issue Type: Improvement > Components: metrics >Reporter: stack > > Of a sudden on a prod cluster, a table's rows gained girth... hundreds of > thousands of rows... and the application was pulling them all back every time > but only once a second or so. Regionserver was carrying hundreds of regions. > Was plain that there was lots of network out traffic. It was tough figuring > which region was the culprit (JD's trick was moving the regions off one at a > time while watching network out traffic on cluster to see whose spiked next > -- it worked but just some time). > If we had per region read/write sizes in metrics, that would have saved a > bunch of diagnostic time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-8318) TableOutputFormat.TableRecordWriter should accept Increments
[ https://issues.apache.org/jira/browse/HBASE-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273760#comment-14273760 ] Jean-Daniel Cryans commented on HBASE-8318: --- [~clehene] hey, so re-reading what I wrote, my opinion remains the same. Fact that nobody else picked this up seems to imply that it isn't a common use case either. I'm fine closing this. > TableOutputFormat.TableRecordWriter should accept Increments > > > Key: HBASE-8318 > URL: https://issues.apache.org/jira/browse/HBASE-8318 > Project: HBase > Issue Type: Bug >Reporter: Jean-Marc Spaggiari >Assignee: Jean-Marc Spaggiari > Attachments: HBASE-8318-v0-trunk.patch, HBASE-8318-v1-trunk.patch, > HBASE-8318-v2-trunk.patch > > > TableOutputFormat.TableRecordWriter can take Puts and Deletes but it should > also accept Increments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12774) Fix the inconsistent permission checks for bulkloading.
[ https://issues.apache.org/jira/browse/HBASE-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14262361#comment-14262361 ] Jean-Daniel Cryans commented on HBASE-12774: Can you fixup the javadoc that I missed in HBASE-11008 while you're there? https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/security/access/AccessController.java#L1961 Also, I'd like [~apurtell]'s benediction on this. > Fix the inconsistent permission checks for bulkloading. > --- > > Key: HBASE-12774 > URL: https://issues.apache.org/jira/browse/HBASE-12774 > Project: HBase > Issue Type: Bug >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu >Priority: Minor > Attachments: HBASE-12774.patch, HBASE-12774_v2.patch > > > Three checks(prePrepareBulkLoad, preCleanupBulkLoad, preBulkLoadHFile) are > being done while performing secure bulk load, and it looks the former two > checks for 'W', while the later checks for 'C'. After having offline chat > with [~jdcryans], looks like the inconsistency among checks were unintended. > So, we can address this multiple ways. > * All checks should be for 'W' > * All checks should be for 'C' > * All checks should be for both 'W' and 'C' > Posting the initial patch going by the first option. Open to discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12774) Fix the inconsistent permission checks for bulkloading.
[ https://issues.apache.org/jira/browse/HBASE-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260570#comment-14260570 ] Jean-Daniel Cryans commented on HBASE-12774: Like we discussed, and because of HBASE-11008, it needs to be C which is your second option. Now, adding a C+W check isn't something we're doing anywhere else (I think?) and I'm not sure if it makes sense. The way things currently work, I'm not sure if it makes sense for a user to have C without having W. > Fix the inconsistent permission checks for bulkloading. > --- > > Key: HBASE-12774 > URL: https://issues.apache.org/jira/browse/HBASE-12774 > Project: HBase > Issue Type: Improvement >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu >Priority: Minor > Attachments: HBASE-12774.patch > > > Three checks(prePrepareBulkLoad, preCleanupBulkLoad, preBulkLoadHFile) are > being done while performing secure bulk load, and it looks the former two > checks for 'W', while the later checks for 'C'. After having offline chat > with [~jdcryans], looks like the inconsistency among checks were unintended. > So, we can address this multiple ways. > * All checks should be for 'W' > * All checks should be for 'C' > * All checks should be for both 'W' and 'C' > Posting the initial patch going by the first option. Open to discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12770) Don't transfer all the queued hlogs of a dead server to the same alive server
[ https://issues.apache.org/jira/browse/HBASE-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260345#comment-14260345 ] Jean-Daniel Cryans commented on HBASE-12770: bq. is it reasonable that the alive server only transfer one peer's hlogs from the dead server Just to make sure I understand you correctly, you're saying that in the case where we have many peers, instead of one server grabbing all the queues we should instead have the servers grab only one so that the load is spread out. If so, this is reasonable, but I would change the jira's title to reflect that (it's kind of vague right now, unless that was the goal to spur discussion). bq. Regionservers could publish their queue depths and if there is a substantial difference detected, one RS could transfer a queue to the less loaded peer. Yeah, but the details might be hard to get correctly. Might be a reason why a particular queue is growing (blocked on whatever) so it will start bouncing around. We could at least have admin tools to move queues though, right now we'd have to move all of them at the same time but with above's work it might be possible to do it more granularly. > Don't transfer all the queued hlogs of a dead server to the same alive server > - > > Key: HBASE-12770 > URL: https://issues.apache.org/jira/browse/HBASE-12770 > Project: HBase > Issue Type: Improvement > Components: Replication >Reporter: cuijianwei >Priority: Minor > > When a region server is down(or the cluster restart), all the hlog queues > will be transferred by the same alive region server. In a shared cluster, we > might create several peers replicating data to different peer clusters. There > might be lots of hlogs queued for these peers caused by several reasons, such > as some peers might be disabled, or errors from peer cluster might prevent > the replication, or the replication sources may fail to read some hlog > because of hdfs problem. Then, if the server is down or restarted, another > alive server will take all the replication jobs of the dead server, this > might bring a big pressure to resources(network/disk read) of the alive > server and also is not fast enough to replicate the queued hlogs. And if the > alive server is down, all the replication jobs including that takes from > other dead servers will once again be totally transferred to another alive > server, this might cause a server have a large number of queued hlogs(in our > shared cluster, we find one server might have thousands of queued hlogs for > replication). As an optional way, is it reasonable that the alive server only > transfer one peer's hlogs from the dead server one time? Then, other alive > region servers might have the opportunity to transfer the hlogs of rest > peers. This may also help the queued hlogs be processed more fast. Any > discussion is welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12665) When aborting, dump metrics
[ https://issues.apache.org/jira/browse/HBASE-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240456#comment-14240456 ] Jean-Daniel Cryans commented on HBASE-12665: +1, still seems like a lot but better than showing everything. Also, I like it better when logged like that. We could also consider printing it out in a more compact way, it's basically how the metrics dump was before, but we can do later if needed. It seems that 0.98 will need a different patch. > When aborting, dump metrics > --- > > Key: HBASE-12665 > URL: https://issues.apache.org/jira/browse/HBASE-12665 > Project: HBase > Issue Type: Bug > Components: Operability >Reporter: stack >Assignee: stack > Fix For: 1.0.0, 2.0.0, 0.98.9 > > Attachments: 0001-First-cut.patch, > 0001-HBASE-12665-When-aborting-dump-metrics.patch, 12665v3.txt, 12665v4.txt, > 12665v5.txt, dump.txt > > > We used to dump out all metrics when we were exiting on abort. Was of use > debugging why the abort. We used to have this. [~jdcryans] noticed it was > dropped by his brother [~eclark] over in HBASE-6410 "Move RegionServer > Metrics to metrics2" To stop the two brothers fighting I intervened with this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12665) When aborting, dump metrics
[ https://issues.apache.org/jira/browse/HBASE-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240277#comment-14240277 ] Jean-Daniel Cryans commented on HBASE-12665: [~stack] From a usability standpoint, how many lines do you get when dumping the metrics? Looks like it would go in the hundreds (also you are showing the master's metrics, standalone?). What I liked about what we had before was that it was concise and contained mostly stuff I was interested in, whereas now we seem to be getting all the histograms. Also, it goes to stdout instead of being logged, was that intentional? I'm not sure how that's gonna come out if there's loads of stuff being logged (like it usually is when a RS is aborting), it might get mixed a bit. Finally, it would be nice to get some description of what's being dumped. I was actually confused when I saw this part: {noformat} 2014-12-09 13:44:21,244 FATAL [main] regionserver.HRegionServer(1927): RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] { "beans" : [ { {noformat} It felt like the metrics were a continuation of the coprocessors printout. I'm easy to confuse so that might just be me. Thanks for picking this up. In this time of the year, family values are more important than ever. > When aborting, dump metrics > --- > > Key: HBASE-12665 > URL: https://issues.apache.org/jira/browse/HBASE-12665 > Project: HBase > Issue Type: Bug > Components: Operability >Reporter: stack >Assignee: stack > Fix For: 1.0.0, 2.0.0, 0.98.9 > > Attachments: 0001-First-cut.patch, > 0001-HBASE-12665-When-aborting-dump-metrics.patch > > > We used to dump out all metrics when we were exiting on abort. Was of use > debugging why the abort. We used to have this. [~jdcryans] noticed it was > dropped by his brother [~eclark] over in HBASE-6410 "Move RegionServer > Metrics to metrics2" To stop the two brothers fighting I intervened with this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12665) When aborting, dump metrics
[ https://issues.apache.org/jira/browse/HBASE-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240046#comment-14240046 ] Jean-Daniel Cryans commented on HBASE-12665: I think your first cut is missing something, I see you lifting a bunch of stuff from JMXJsonServlet into JSONBean but the former still has most of the latter's code. > When aborting, dump metrics > --- > > Key: HBASE-12665 > URL: https://issues.apache.org/jira/browse/HBASE-12665 > Project: HBase > Issue Type: Bug > Components: Operability >Reporter: stack >Assignee: stack > Fix For: 1.0.0, 2.0.0, 0.98.9 > > Attachments: 0001-First-cut.patch > > > We used to dump out all metrics when we were exiting on abort. Was of use > debugging why the abort. We used to have this. [~jdcryans] noticed it was > dropped by his brother [~eclark] over in HBASE-6410 "Move RegionServer > Metrics to metrics2" To stop the two brothers fighting I intervened with this > patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12419) "Partial cell read caused by EOF" ERRORs on replication source during replication
[ https://issues.apache.org/jira/browse/HBASE-12419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199480#comment-14199480 ] Jean-Daniel Cryans commented on HBASE-12419: Yeah, getting partial reads has always been a reality in ReplicationSource, even before PBing everything. I'm +1 on TRACEing this but what about all the other users of BaseDecoder? Are there cases where we rely on this LOG.error to show that something went wrong? > "Partial cell read caused by EOF" ERRORs on replication source during > replication > - > > Key: HBASE-12419 > URL: https://issues.apache.org/jira/browse/HBASE-12419 > Project: HBase > Issue Type: Bug >Affects Versions: 0.98.7 >Reporter: Andrew Purtell > Fix For: 2.0.0, 0.98.8, 0.99.2 > > Attachments: TestReplicationIngest.patch > > > We are seeing exceptions like these on the replication sources when > replication is active: > {noformat} > 2014-11-04 01:20:19,738 ERROR > [regionserver8120-EventThread.replicationSource,1] codec.BaseDecoder: > Partial cell read caused by EOF: java.io.IOException: Premature EOF from > inputStream > {noformat} > HBase 0.98.8-SNAPSHOT, Hadoop 2.4.1. > Happens both with and without short circuit reads on the source cluster. > I'm able to reproduce this reliably: > # Set up two clusters. Can be single slave. > # Enable replication in configuration > # Use LoadTestTool -init_only on both clusters > # On source cluster via shell: alter > 'cluster_test',{NAME=>'test_cf',REPLICATION_SCOPE=>1} > # On source cluster via shell: add_peer 'remote:port:/hbase' > # On source cluster, LoadTestTool -skip_init -write 1:1024:10 -num_keys > 100 > # Wait for LoadTestTool to complete > # Use the shell to verify 1M rows are in 'cluster_test' on the target cluster. > All 1M rows will replicate without data loss, but I'll see 5-15 instances of > "Partial cell read caused by EOF" messages logged from codec.BaseDecoder at > ERROR level on the replication source. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11316) Expand info about compactions beyond HBASE-11120
[ https://issues.apache.org/jira/browse/HBASE-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14078104#comment-14078104 ] Jean-Daniel Cryans commented on HBASE-11316: +1, thanks for looking at this Stack and sorry I overlooked those failing tests. > Expand info about compactions beyond HBASE-11120 > > > Key: HBASE-11316 > URL: https://issues.apache.org/jira/browse/HBASE-11316 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Fix For: 2.0.0 > > Attachments: HBASE-11316-1.patch, HBASE-11316-2.patch, > HBASE-11316-3.patch, HBASE-11316-4.patch, HBASE-11316-5.patch, > HBASE-11316-6-rebased-v2.patch, HBASE-11316-6-rebased.patch, > HBASE-11316-6.patch, HBASE-11316-7.patch, HBASE-11316-8.patch, > HBASE-11316-9.patch, HBASE-11316.patch, addendum.txt, ch9_compactions.pdf > > > Round 2 - expand info about the algorithms, talk about stripe compaction, and > talk more about configuration. Hopefully find some rules of thumb. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11316) Expand info about compactions beyond HBASE-11120
[ https://issues.apache.org/jira/browse/HBASE-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11316: --- Resolution: Fixed Fix Version/s: 2.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to master, thanks for all the hard work documenting our most complicated stuff Misty. > Expand info about compactions beyond HBASE-11120 > > > Key: HBASE-11316 > URL: https://issues.apache.org/jira/browse/HBASE-11316 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Fix For: 2.0.0 > > Attachments: HBASE-11316-1.patch, HBASE-11316-2.patch, > HBASE-11316-3.patch, HBASE-11316-4.patch, HBASE-11316-5.patch, > HBASE-11316-6-rebased-v2.patch, HBASE-11316-6-rebased.patch, > HBASE-11316-6.patch, HBASE-11316-7.patch, HBASE-11316-8.patch, > HBASE-11316-9.patch, HBASE-11316.patch, ch9_compactions.pdf > > > Round 2 - expand info about the algorithms, talk about stripe compaction, and > talk more about configuration. Hopefully find some rules of thumb. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11316) Expand info about compactions beyond HBASE-11120
[ https://issues.apache.org/jira/browse/HBASE-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077225#comment-14077225 ] Jean-Daniel Cryans commented on HBASE-11316: Comments: {noformat} -Name of the failsafe snapshot taken by the restore operation. +Name of the failsafe snapshot taken by the restore opecompn. {noformat} Typo slipped in there. {noformat} + Being Stuck {noformat} Looking at the output, this line (and the other titles at that level) is comically small. There's probably we can do short of changing the style sheet. And I'm good with the rest. > Expand info about compactions beyond HBASE-11120 > > > Key: HBASE-11316 > URL: https://issues.apache.org/jira/browse/HBASE-11316 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11316-1.patch, HBASE-11316-2.patch, > HBASE-11316-3.patch, HBASE-11316-4.patch, HBASE-11316-5.patch, > HBASE-11316-6-rebased-v2.patch, HBASE-11316-6-rebased.patch, > HBASE-11316-6.patch, HBASE-11316-7.patch, HBASE-11316-8.patch, > HBASE-11316.patch, ch9_compactions.pdf > > > Round 2 - expand info about the algorithms, talk about stripe compaction, and > talk more about configuration. Hopefully find some rules of thumb. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11522) Move Replication information into the Ref Guide
[ https://issues.apache.org/jira/browse/HBASE-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11522: --- Resolution: Fixed Fix Version/s: 2.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I compiled the site and looked and the end result, +1. Pushed to master. Thanks Misty! > Move Replication information into the Ref Guide > --- > > Key: HBASE-11522 > URL: https://issues.apache.org/jira/browse/HBASE-11522 > Project: HBase > Issue Type: Bug > Components: documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Fix For: 2.0.0 > > Attachments: HBASE-11522-1.patch, HBASE-11522.patch, > replication_overview.png > > > Currently, information about replication lives at > http://hbase.apache.org/replication.html and > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements. > IMHO, it makes sense at least for the first link to be in the Ref Guide > instead of where it is now. This came up because of HBASE-10398. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-11388) The order parameter is wrong when invoking the constructor of the ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl
[ https://issues.apache.org/jira/browse/HBASE-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-11388. Resolution: Fixed Fix Version/s: (was: 0.99.0) Hadoop Flags: Reviewed You're right, so I just pushed to patch to 0.98. Thanks! > The order parameter is wrong when invoking the constructor of the > ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl > - > > Key: HBASE-11388 > URL: https://issues.apache.org/jira/browse/HBASE-11388 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.99.0, 0.98.3 >Reporter: Qianxi Zhang >Assignee: Qianxi Zhang >Priority: Minor > Fix For: 0.98.5 > > Attachments: HBASE_11388.patch, HBASE_11388_trunk_V1.patch > > > The parameters is "Configurationi", "ClusterKey" and "id" in the constructor > of the class ReplicationPeer. But he order parameter is "Configurationi", > "id" and "ClusterKey" when invoking the constructor of the ReplicationPeer In > the method "getPeer" of the class ReplicationPeersZKImpl > ReplicationPeer#76 > {code} > public ReplicationPeer(Configuration conf, String key, String id) throws > ReplicationException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > try { > this.reloadZkWatcher(); > } catch (IOException e) { > throw new ReplicationException("Error connecting to peer cluster with > peerId=" + id, e); > } > } > {code} > ReplicationPeersZKImpl#498 > {code} > ReplicationPeer peer = > new ReplicationPeer(peerConf, peerId, > ZKUtil.getZooKeeperClusterKey(peerConf)); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11388) The order parameter is wrong when invoking the constructor of the ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl
[ https://issues.apache.org/jira/browse/HBASE-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11388: --- Status: Open (was: Patch Available) Well, this needs a rebase. I'm back from vacation now, [~qianxiZhang], so if you resubmit a patch this week we could get this in quickly. > The order parameter is wrong when invoking the constructor of the > ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl > - > > Key: HBASE-11388 > URL: https://issues.apache.org/jira/browse/HBASE-11388 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.98.3, 0.99.0 >Reporter: Qianxi Zhang >Assignee: Qianxi Zhang >Priority: Minor > Fix For: 0.99.0, 0.98.5 > > Attachments: HBASE_11388.patch, HBASE_11388_trunk_V1.patch > > > The parameters is "Configurationi", "ClusterKey" and "id" in the constructor > of the class ReplicationPeer. But he order parameter is "Configurationi", > "id" and "ClusterKey" when invoking the constructor of the ReplicationPeer In > the method "getPeer" of the class ReplicationPeersZKImpl > ReplicationPeer#76 > {code} > public ReplicationPeer(Configuration conf, String key, String id) throws > ReplicationException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > try { > this.reloadZkWatcher(); > } catch (IOException e) { > throw new ReplicationException("Error connecting to peer cluster with > peerId=" + id, e); > } > } > {code} > ReplicationPeersZKImpl#498 > {code} > ReplicationPeer peer = > new ReplicationPeer(peerConf, peerId, > ZKUtil.getZooKeeperClusterKey(peerConf)); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11388) The order parameter is wrong when invoking the constructor of the ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl
[ https://issues.apache.org/jira/browse/HBASE-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11388: --- Status: Patch Available (was: Open) +1 on the second patch, I'll get a Hadoop QA run for good measure. > The order parameter is wrong when invoking the constructor of the > ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl > - > > Key: HBASE-11388 > URL: https://issues.apache.org/jira/browse/HBASE-11388 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.98.3, 0.99.0 >Reporter: Qianxi Zhang >Assignee: Qianxi Zhang >Priority: Minor > Fix For: 0.99.0, 0.98.5 > > Attachments: HBASE_11388.patch, HBASE_11388_trunk_V1.patch > > > The parameters is "Configurationi", "ClusterKey" and "id" in the constructor > of the class ReplicationPeer. But he order parameter is "Configurationi", > "id" and "ClusterKey" when invoking the constructor of the ReplicationPeer In > the method "getPeer" of the class ReplicationPeersZKImpl > ReplicationPeer#76 > {code} > public ReplicationPeer(Configuration conf, String key, String id) throws > ReplicationException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > try { > this.reloadZkWatcher(); > } catch (IOException e) { > throw new ReplicationException("Error connecting to peer cluster with > peerId=" + id, e); > } > } > {code} > ReplicationPeersZKImpl#498 > {code} > ReplicationPeer peer = > new ReplicationPeer(peerConf, peerId, > ZKUtil.getZooKeeperClusterKey(peerConf)); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11522) Move Replication information into the Ref Guide
[ https://issues.apache.org/jira/browse/HBASE-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072071#comment-14072071 ] Jean-Daniel Cryans commented on HBASE-11522: A few little things: {noformat} +Apache HBase (TM) provides a replication mechanism to copy data between HBase {noformat} We needed to put the whole Apache HBase (TM) here because we were mentioning HBase for the first time, but now that it's in the middle of the documentation we can just replace this by just "HBase provides a replication...". {noformat} + {noformat} It's not a procedure, it's more a description of what happens when you replicate. {noformat} + Replication FAQ {noformat} We can remove the whole section, it's stale and nobody's been asking those questions. Finally, we could use a few more "xml:id" tags, at least for the first level below replication. > Move Replication information into the Ref Guide > --- > > Key: HBASE-11522 > URL: https://issues.apache.org/jira/browse/HBASE-11522 > Project: HBase > Issue Type: Bug > Components: documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11522.patch, replication_overview.png > > > Currently, information about replication lives at > http://hbase.apache.org/replication.html and > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements. > IMHO, it makes sense at least for the first link to be in the Ref Guide > instead of where it is now. This came up because of HBASE-10398. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11316) Expand info about compactions beyond HBASE-11120
[ https://issues.apache.org/jira/browse/HBASE-11316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049476#comment-14049476 ] Jean-Daniel Cryans commented on HBASE-11316: IMO, unless we specifically refer to something related to the HFile implementation, we should always use StoreFile. That's also how we present in the web UI when list the number of Stores and StoreFiles in a region. > Expand info about compactions beyond HBASE-11120 > > > Key: HBASE-11316 > URL: https://issues.apache.org/jira/browse/HBASE-11316 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11316-1.patch, HBASE-11316-2.patch, > HBASE-11316-3.patch, HBASE-11316-4.patch, HBASE-11316-5.patch, > HBASE-11316.patch, ch9_compactions.pdf > > > Round 2 - expand info about the algorithms, talk about stripe compaction, and > talk more about configuration. Hopefully find some rules of thumb. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11388) The order parameter is wrong when invoking the constructor of the ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl
[ https://issues.apache.org/jira/browse/HBASE-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042287#comment-14042287 ] Jean-Daniel Cryans commented on HBASE-11388: Ah, sorry, I wasn't clear. I was thinking that we should only remove it from the constructor, but then populate that field within the constructor using getZooKeeperClusterKey. > The order parameter is wrong when invoking the constructor of the > ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl > - > > Key: HBASE-11388 > URL: https://issues.apache.org/jira/browse/HBASE-11388 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.99.0, 0.98.3 >Reporter: Qianxi Zhang >Assignee: Qianxi Zhang >Priority: Minor > Fix For: 0.99.0 > > Attachments: HBASE_11388.patch > > > The parameters is "Configurationi", "ClusterKey" and "id" in the constructor > of the class ReplicationPeer. But he order parameter is "Configurationi", > "id" and "ClusterKey" when invoking the constructor of the ReplicationPeer In > the method "getPeer" of the class ReplicationPeersZKImpl > ReplicationPeer#76 > {code} > public ReplicationPeer(Configuration conf, String key, String id) throws > ReplicationException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > try { > this.reloadZkWatcher(); > } catch (IOException e) { > throw new ReplicationException("Error connecting to peer cluster with > peerId=" + id, e); > } > } > {code} > ReplicationPeersZKImpl#498 > {code} > ReplicationPeer peer = > new ReplicationPeer(peerConf, peerId, > ZKUtil.getZooKeeperClusterKey(peerConf)); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11388) The order parameter is wrong when invoking the constructor of the ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl
[ https://issues.apache.org/jira/browse/HBASE-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040977#comment-14040977 ] Jean-Daniel Cryans commented on HBASE-11388: Wow nice one [~qianxiZhang], I'm surprised that things even work but it looks like we don't do anything useful with clusterKey in ReplicationPeer, we just use the configuration. I think a better interface would be to just remove the passing of the clusterKey in ReplicationPeer's constructor since we can infer it using ZKUtil#getZooKeeperClusterKey(). > The order parameter is wrong when invoking the constructor of the > ReplicationPeer In the method "getPeer" of the class ReplicationPeersZKImpl > - > > Key: HBASE-11388 > URL: https://issues.apache.org/jira/browse/HBASE-11388 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 0.99.0, 0.98.3 >Reporter: Qianxi Zhang >Assignee: Qianxi Zhang >Priority: Minor > Fix For: 0.99.0 > > Attachments: HBASE_11388.patch > > > The parameters is "Configurationi", "ClusterKey" and "id" in the constructor > of the class ReplicationPeer. But he order parameter is "Configurationi", > "id" and "ClusterKey" when invoking the constructor of the ReplicationPeer In > the method "getPeer" of the class ReplicationPeersZKImpl > ReplicationPeer#76 > {code} > public ReplicationPeer(Configuration conf, String key, String id) throws > ReplicationException { > this.conf = conf; > this.clusterKey = key; > this.id = id; > try { > this.reloadZkWatcher(); > } catch (IOException e) { > throw new ReplicationException("Error connecting to peer cluster with > peerId=" + id, e); > } > } > {code} > ReplicationPeersZKImpl#498 > {code} > ReplicationPeer peer = > new ReplicationPeer(peerConf, peerId, > ZKUtil.getZooKeeperClusterKey(peerConf)); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10504) Define Replication Interface
[ https://issues.apache.org/jira/browse/HBASE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14033216#comment-14033216 ] Jean-Daniel Cryans commented on HBASE-10504: bq. Do you want to do this in a subtask? Yes, please. > Define Replication Interface > > > Key: HBASE-10504 > URL: https://issues.apache.org/jira/browse/HBASE-10504 > Project: HBase > Issue Type: Task >Reporter: stack >Assignee: Enis Soztutar >Priority: Blocker > Fix For: 0.99.0 > > Attachments: hbase-10504_v1.patch, hbase-10504_wip1.patch > > > HBase has replication. Fellas have been hijacking the replication apis to do > all kinds of perverse stuff like indexing hbase content (hbase-indexer > https://github.com/NGDATA/hbase-indexer) and our [~toffer] just showed up w/ > overrides that replicate via an alternate channel (over a secure thrift > channel between dcs over on HBASE-9360). This issue is about surfacing these > APIs as public with guarantees to downstreamers similar to those we have on > our public client-facing APIs (and so we don't break them for downstreamers). > Any input [~phunt] or [~gabriel.reid] or [~toffer]? > Thanks. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-6192) Document ACL matrix in the book
[ https://issues.apache.org/jira/browse/HBASE-6192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032601#comment-14032601 ] Jean-Daniel Cryans commented on HBASE-6192: --- bq. The code seems to indicate that BulkLoadHFile requires CREATE but the PDF says it requires WRITE. Which is true? I changed it in HBASE-11008, the explanation is in the release note. > Document ACL matrix in the book > --- > > Key: HBASE-6192 > URL: https://issues.apache.org/jira/browse/HBASE-6192 > Project: HBase > Issue Type: Task > Components: documentation, security >Affects Versions: 0.94.1, 0.95.2 >Reporter: Enis Soztutar >Assignee: Misty Stanley-Jones > Labels: documentaion, security > Fix For: 0.99.0 > > Attachments: HBASE-6192-rebased.patch, HBASE-6192.patch, HBase > Security-ACL Matrix.pdf, HBase Security-ACL Matrix.pdf, HBase Security-ACL > Matrix.pdf, HBase Security-ACL Matrix.xls, HBase Security-ACL Matrix.xls, > HBase Security-ACL Matrix.xls > > > We have an excellent matrix at > https://issues.apache.org/jira/secure/attachment/12531252/Security-ACL%20Matrix.pdf > for ACL. Once the changes are done, we can adapt that and put it in the > book, also add some more documentation about the new authorization features. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-8844) Document the removal of replication state AKA start/stop_replication
[ https://issues.apache.org/jira/browse/HBASE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-8844: -- Resolution: Fixed Fix Version/s: 0.99.0 Status: Resolved (was: Patch Available) Pushed to trunk, thanks Misty! > Document the removal of replication state AKA start/stop_replication > > > Key: HBASE-8844 > URL: https://issues.apache.org/jira/browse/HBASE-8844 > Project: HBase > Issue Type: Task > Components: documentation >Reporter: Patrick >Assignee: Misty Stanley-Jones >Priority: Minor > Fix For: 0.99.0 > > Attachments: HBASE-8844-v2.patch, HBASE-8844-v3-rebased.patch, > HBASE-8844-v3.patch, HBASE-8844.patch > > > The first two tutorials for enabling replication that google gives me [1], > [2] take very different tones with regard to stop_replication. The HBase docs > [1] make it sound fine to start and stop replication as desired. The Cloudera > docs [2] say it may cause data loss. > Which is true? If data loss is possible, are we talking about data loss in > the primary cluster, or data loss in the standby cluster (presumably would > require reinitializing the sync with a new CopyTable). > [1] > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements > [2] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_20_11.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-8844) Document the removal of replication state AKA start/stop_replication
[ https://issues.apache.org/jira/browse/HBASE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-8844: -- Summary: Document the removal of replication state AKA start/stop_replication (was: Document stop_replication danger) Updating the jira's summary to better reflect what we ended up doing. > Document the removal of replication state AKA start/stop_replication > > > Key: HBASE-8844 > URL: https://issues.apache.org/jira/browse/HBASE-8844 > Project: HBase > Issue Type: Task > Components: documentation >Reporter: Patrick >Assignee: Misty Stanley-Jones >Priority: Minor > Attachments: HBASE-8844-v2.patch, HBASE-8844-v3-rebased.patch, > HBASE-8844-v3.patch, HBASE-8844.patch > > > The first two tutorials for enabling replication that google gives me [1], > [2] take very different tones with regard to stop_replication. The HBase docs > [1] make it sound fine to start and stop replication as desired. The Cloudera > docs [2] say it may cause data loss. > Which is true? If data loss is possible, are we talking about data loss in > the primary cluster, or data loss in the standby cluster (presumably would > require reinitializing the sync with a new CopyTable). > [1] > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements > [2] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_20_11.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10504) Define Replication Interface
[ https://issues.apache.org/jira/browse/HBASE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029806#comment-14029806 ] Jean-Daniel Cryans commented on HBASE-10504: As long as the RE daemons are disjoint from the region servers that would work, since you need to be able to have as many/few of them as you want, plus to be able to have them on different machines. I could even see this as being useful for certain use cases that want inter-HBase replication to be done through specific gateways. We then have to architect a way to discover those endpoints, etc... kind of what the current replication code already does with sinks. > Define Replication Interface > > > Key: HBASE-10504 > URL: https://issues.apache.org/jira/browse/HBASE-10504 > Project: HBase > Issue Type: Task >Reporter: stack >Assignee: Enis Soztutar >Priority: Blocker > Fix For: 0.99.0 > > Attachments: hbase-10504_v1.patch, hbase-10504_wip1.patch > > > HBase has replication. Fellas have been hijacking the replication apis to do > all kinds of perverse stuff like indexing hbase content (hbase-indexer > https://github.com/NGDATA/hbase-indexer) and our [~toffer] just showed up w/ > overrides that replicate via an alternate channel (over a secure thrift > channel between dcs over on HBASE-9360). This issue is about surfacing these > APIs as public with guarantees to downstreamers similar to those we have on > our public client-facing APIs (and so we don't break them for downstreamers). > Any input [~phunt] or [~gabriel.reid] or [~toffer]? > Thanks. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10504) Define Replication Interface
[ https://issues.apache.org/jira/browse/HBASE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029275#comment-14029275 ] Jean-Daniel Cryans commented on HBASE-10504: [~enis], as per the discussion above, I think what you are proposing is good but it doesn't solve the problem that this jira is about. How about we open a new jira, titled something like "Make ReplicationSource extendable", and do the review/commit over there? > Define Replication Interface > > > Key: HBASE-10504 > URL: https://issues.apache.org/jira/browse/HBASE-10504 > Project: HBase > Issue Type: Task >Reporter: stack >Assignee: Enis Soztutar >Priority: Blocker > Fix For: 0.99.0 > > Attachments: hbase-10504_v1.patch, hbase-10504_wip1.patch > > > HBase has replication. Fellas have been hijacking the replication apis to do > all kinds of perverse stuff like indexing hbase content (hbase-indexer > https://github.com/NGDATA/hbase-indexer) and our [~toffer] just showed up w/ > overrides that replicate via an alternate channel (over a secure thrift > channel between dcs over on HBASE-9360). This issue is about surfacing these > APIs as public with guarantees to downstreamers similar to those we have on > our public client-facing APIs (and so we don't break them for downstreamers). > Any input [~phunt] or [~gabriel.reid] or [~toffer]? > Thanks. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-8844) Document stop_replication danger
[ https://issues.apache.org/jira/browse/HBASE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025308#comment-14025308 ] Jean-Daniel Cryans commented on HBASE-8844: --- [~misty] I was about to commit and was double checking the documentation and I figured that we should remove the whole "state znode" section since that was also connected to start/stop_replication. I can do it on commit if that's ok with you. > Document stop_replication danger > > > Key: HBASE-8844 > URL: https://issues.apache.org/jira/browse/HBASE-8844 > Project: HBase > Issue Type: Task > Components: documentation >Reporter: Patrick >Assignee: Misty Stanley-Jones >Priority: Minor > Attachments: HBASE-8844-v2.patch, HBASE-8844-v3-rebased.patch, > HBASE-8844-v3.patch, HBASE-8844.patch > > > The first two tutorials for enabling replication that google gives me [1], > [2] take very different tones with regard to stop_replication. The HBase docs > [1] make it sound fine to start and stop replication as desired. The Cloudera > docs [2] say it may cause data loss. > Which is true? If data loss is possible, are we talking about data loss in > the primary cluster, or data loss in the standby cluster (presumably would > require reinitializing the sync with a new CopyTable). > [1] > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements > [2] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_20_11.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11120) Update documentation about major compaction algorithm
[ https://issues.apache.org/jira/browse/HBASE-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025292#comment-14025292 ] Jean-Daniel Cryans commented on HBASE-11120: bq. I don't see hbase.hstore.compaction.min.size in code base any more Yeah I was confused by this too, [~stack], until I found this thread that you started for this reason a while back: http://osdir.com/ml/general/2013-06/msg41068.html Basically, {{CompactionConfiguration}} was written with a "configuration prefix" so doing a grep on that config (or hbase.hstore.compaction.max.size) won't return anything. I think we should undo the "refactoring" in that class since it obviously breaks a pattern we're used to, which is to completely spell out the configuration names so we can easily search for them. > Update documentation about major compaction algorithm > - > > Key: HBASE-11120 > URL: https://issues.apache.org/jira/browse/HBASE-11120 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Affects Versions: 0.98.2 >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11120-2.patch, HBASE-11120-3-rebased.patch, > HBASE-11120-3.patch, HBASE-11120-4-rebased.patch, HBASE-11120-4.patch, > HBASE-11120.patch > > > [14:20:38] seems that there's > http://hbase.apache.org/book.html#compaction and > http://hbase.apache.org/book.html#managed.compactions > [14:20:56] the latter doesn't say much, except that you > should manage them > [14:21:44] the former gives a good description of the > _old_ selection algo > [14:45:25] this is the new selection algo since C5 / > 0.96.0: https://issues.apache.org/jira/browse/HBASE-7842 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11120) Update documentation about major compaction algorithm
[ https://issues.apache.org/jira/browse/HBASE-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012875#comment-14012875 ] Jean-Daniel Cryans commented on HBASE-11120: Spent some more time reading through and I saw that I had missed a few things. BTW, thanks again Misty for working on this, it is much needed. I'm not sure about this part: {quote} However, if the normal compaction +algorithm do not find any normally-eligible StoreFiles, a major compaction is +the only way to get out of this situation, and is forced. This is also called a +size-based or size-triggered major compaction. {quote} The exploration compaction will select the smallest set of files to compact and it won't trigger a major compaction. Also, I looked for "size-based" and "size-triggered" in the code but couldn't find anything so I'm not sure what this is referring to. This is wrong: {quote} +If this compaction was user-requested, do a major compaction. + + Compactions can run on a schedule or can be initiated manually. If a +compaction is requested manually, it is always a major compaction. If the +compaction is user-requested, the major compaction still happens even if the are +more than hbase.hstore.compaction.max files that need +compaction. + {quote} It will do a minor compaction if the user asks for that, or a major compaction if it's specifically asked for. It might be good to mention that hbase.hstore.compactionThreshold was the old name for hbase.hstore.compaction.min in this section: {quote} + +hbase.hstore.compaction.min +The minimum number of files which must be eligible for compaction before + compaction can run. +3 + {quote} Those two are wrong (if 1000 bytes was the max we'd probably never compact :P): {quote} + +hbase.hstore.compaction.min.size +A StoreFile smaller than this size (in bytes) will always be eligible for + minor compaction. +10 + + +hbase.hstore.compaction.max.size +A StoreFile larger than this size (in bytes) will be excluded from minor + compaction. +1000 + {quote} It should be 128MB and Long.MAX_VALUE as per: https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/CompactionConfiguration.java#L71 > Update documentation about major compaction algorithm > - > > Key: HBASE-11120 > URL: https://issues.apache.org/jira/browse/HBASE-11120 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Affects Versions: 0.98.2 >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11120-2.patch, HBASE-11120-3-rebased.patch, > HBASE-11120-3.patch, HBASE-11120.patch > > > [14:20:38] seems that there's > http://hbase.apache.org/book.html#compaction and > http://hbase.apache.org/book.html#managed.compactions > [14:20:56] the latter doesn't say much, except that you > should manage them > [14:21:44] the former gives a good description of the > _old_ selection algo > [14:45:25] this is the new selection algo since C5 / > 0.96.0: https://issues.apache.org/jira/browse/HBASE-7842 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-8844) Document stop_replication danger
[ https://issues.apache.org/jira/browse/HBASE-8844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012492#comment-14012492 ] Jean-Daniel Cryans commented on HBASE-8844: --- This part isn't right: {quote} It is also cached locally using an AtomicBoolean in the ReplicationZookeeper -class. This value can be changed on a live cluster using the stop_replication +class. This value can be changed on a live cluster using the disable_peer {quote} The AtomicBoolean was completely removed, so there's now no way to toggle hbase.replication while the cluster is running. So that part can be completely removed, and hbase.replication is now only read when HBase starts. Regarding disable_peer, it might be good to add that what it does is it stops sending the edits to that peer cluster, but it still keeps track of all the new WALs that it will have to replicate when it's re-enabled. > Document stop_replication danger > > > Key: HBASE-8844 > URL: https://issues.apache.org/jira/browse/HBASE-8844 > Project: HBase > Issue Type: Task > Components: documentation >Reporter: Patrick >Assignee: Misty Stanley-Jones >Priority: Minor > Attachments: HBASE-8844-v2.patch, HBASE-8844.patch > > > The first two tutorials for enabling replication that google gives me [1], > [2] take very different tones with regard to stop_replication. The HBase docs > [1] make it sound fine to start and stop replication as desired. The Cloudera > docs [2] say it may cause data loss. > Which is true? If data loss is possible, are we talking about data loss in > the primary cluster, or data loss in the standby cluster (presumably would > require reinitializing the sync with a new CopyTable). > [1] > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-summary.html#requirements > [2] > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_20_11.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11196) Update description of -ROOT- in ref guide
[ https://issues.apache.org/jira/browse/HBASE-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011827#comment-14011827 ] Jean-Daniel Cryans commented on HBASE-11196: Nit: bq. tracked by Zookeeper Should be "stored in ZooKeeper". Regarding the section about {{HTablePool}}, it's now a deprecated class, so we should eventually update that too (but let's stick to this jira's scope). The rest looks good! > Update description of -ROOT- in ref guide > - > > Key: HBASE-11196 > URL: https://issues.apache.org/jira/browse/HBASE-11196 > Project: HBase > Issue Type: Bug > Components: documentation >Reporter: Dima Spivak >Assignee: Misty Stanley-Jones > Attachments: HBASE-11196.patch > > > Since the resolution of HBASE-3171, \-ROOT- is no longer used to store the > location(s) of .META. . Unfortunately, not all of [our > documentation|http://hbase.apache.org/book/arch.catalog.html] has been > updated to reflect this change in architecture. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11120) Update documentation about major compaction algorithm
[ https://issues.apache.org/jira/browse/HBASE-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006480#comment-14006480 ] Jean-Daniel Cryans commented on HBASE-11120: On top of Sergey's comments: This could also use a sentence regarding TTLs, they are simply filtered out and tombstones aren't created: {noformat} bq. Compaction and Deletions {noformat} Same kind of comment for this para, the old versions are filtered more than deleted: {noformat} +Compaction and Versions {noformat} One thing bothers me about this para: {noformat} +The compaction algorithms used by HBase have evolved over time. HBase 0.96 + introduced a new algorithm for compaction file selection. To find out about the old + algorithm, see . The rest of this section describes the new algorithm, which + was implemented in https://issues.apache.org/jira/browse/HBASE-7842";>HBASE-7842. {noformat} What's written next is much more about how to pick the files that will then be considered for compaction than the actual new policy which is in ExploringCompactionPolicy.java. In other words, the text that follows what I just quoted is ok but not really new in 0.96 via HBASE-7842, and it seems to me that we should explain what HBASE-7842 does. > Update documentation about major compaction algorithm > - > > Key: HBASE-11120 > URL: https://issues.apache.org/jira/browse/HBASE-11120 > Project: HBase > Issue Type: Bug > Components: Compaction, documentation >Affects Versions: 0.98.2 >Reporter: Misty Stanley-Jones >Assignee: Misty Stanley-Jones > Attachments: HBASE-11120.patch > > > [14:20:38] seems that there's > http://hbase.apache.org/book.html#compaction and > http://hbase.apache.org/book.html#managed.compactions > [14:20:56] the latter doesn't say much, except that you > should manage them > [14:21:44] the former gives a good description of the > _old_ selection algo > [14:45:25] this is the new selection algo since C5 / > 0.96.0: https://issues.apache.org/jira/browse/HBASE-7842 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-10946) hbase.metrics.showTableName cause run “hbase xxx.Hfile” report Inconsistent configuration
[ https://issues.apache.org/jira/browse/HBASE-10946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-10946. Resolution: Duplicate I'm fixing this in HBASE-11188. > hbase.metrics.showTableName cause run “hbase xxx.Hfile” report Inconsistent > configuration > --- > > Key: HBASE-10946 > URL: https://issues.apache.org/jira/browse/HBASE-10946 > Project: HBase > Issue Type: Bug > Components: HFile >Affects Versions: 0.94.17 > Environment: hadoop 1.2.1 hbase0.94.17 >Reporter: Zhang Jingpeng >Priority: Minor > > when i run hbase org.apache.hadoop.hbase.io.hfile.Hfile ,print the error info > "ERROR metrics.SchemaMetrics: Inconsistent configuration. Previous > configuration for using table name in metrics: true, new configuration: false" > I find the report section in Hfile -->setUseTableName method > and to be called by next > final boolean useTableNameNew = > conf.getBoolean(SHOW_TABLE_NAME_CONF_KEY, false); > setUseTableName(useTableNameNew); > but the hbase-default.xml is hbase.metrics.showTableName > true -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11133) Add an option to skip snapshot verification after Export
[ https://issues.apache.org/jira/browse/HBASE-11133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993007#comment-13993007 ] Jean-Daniel Cryans commented on HBASE-11133: Something like "-skip-copy-sanity-check" would be better, also maybe describe more what the sanity check is doing? > Add an option to skip snapshot verification after Export > > > Key: HBASE-11133 > URL: https://issues.apache.org/jira/browse/HBASE-11133 > Project: HBase > Issue Type: Bug > Components: snapshots >Affects Versions: 0.99.0 >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Trivial > Fix For: 0.99.0 > > Attachments: HBASE-11133-v0.patch, HBASE-11133-v1.patch > > > Add a "-skip-dst-verify" option to skip snapshot verification after Export -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11188) "Inconsistent configuration" for SchemaMetrics is always shown
Jean-Daniel Cryans created HBASE-11188: -- Summary: "Inconsistent configuration" for SchemaMetrics is always shown Key: HBASE-11188 URL: https://issues.apache.org/jira/browse/HBASE-11188 Project: HBase Issue Type: Bug Components: metrics Affects Versions: 0.94.19 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.94.20 Attachments: HBASE-11188-0.94-v2.patch, HBASE-11188-0.94.patch Some users have been complaining about this message: {noformat} ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false {noformat} The interesting thing is that we see it with default configurations, which made me think that some code path must have been passing the wrong thing. I found that if SchemaConfigured is passed a null Configuration in its constructor that it will then pass null to SchemaMetrics#configureGlobally which will interpret useTableName as being false: {code} public static void configureGlobally(Configuration conf) { if (conf != null) { final boolean useTableNameNew = conf.getBoolean(SHOW_TABLE_NAME_CONF_KEY, false); setUseTableName(useTableNameNew); } else { setUseTableName(false); } } {code} It should be set to true since that's the new default, meaning we missed it in HBASE-5671. I found one code path that passes a null configuration, StoreFile.Reader extends SchemaConfigured and uses the constructor that only passes a Path, so the Configuration is set to null. I'm planning on just passing true instead of false, fixing the problem for almost everyone (those that disable this feature will get the error message). IMO it's not worth more efforts since it's a 0.94-only problem and it's not actually doing anything bad. I'm closing both HBASE-10990 and HBASE-10946 as duplicates. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-6990) Pretty print TTL
[ https://issues.apache.org/jira/browse/HBASE-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000440#comment-14000440 ] Jean-Daniel Cryans commented on HBASE-6990: --- That representation is pretty much my original ask, +1 > Pretty print TTL > > > Key: HBASE-6990 > URL: https://issues.apache.org/jira/browse/HBASE-6990 > Project: HBase > Issue Type: Improvement > Components: Usability >Reporter: Jean-Daniel Cryans >Assignee: Esteban Gutierrez >Priority: Minor > Attachments: HBASE-6990.v0.patch, HBASE-6990.v1.patch, > HBASE-6990.v2.patch, HBASE-6990.v3.patch, HBASE-6990.v4.patch > > > I've seen a lot of users getting confused by the TTL configuration and I > think that if we just pretty printed it it would solve most of the issues. > For example, let's say a user wanted to set a TTL of 90 days. That would be > 7776000. But let's say that it was typo'd to 7776 instead, it gives you > 900 days! > So when we print the TTL we could do something like "x days, x hours, x > minutes, x seconds (real_ttl_value)". This would also help people when they > use ms instead of seconds as they would see really big values in there. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11188) "Inconsistent configuration" for SchemaMetrics is always shown
[ https://issues.apache.org/jira/browse/HBASE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11188: --- Attachment: HBASE-11188-0.94-v2.patch To be extra correct there's another line that should default to true, attaching the patch. > "Inconsistent configuration" for SchemaMetrics is always shown > -- > > Key: HBASE-11188 > URL: https://issues.apache.org/jira/browse/HBASE-11188 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 0.94.19 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.94.20 > > Attachments: HBASE-11188-0.94-v2.patch, HBASE-11188-0.94.patch > > > Some users have been complaining about this message: > {noformat} > ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: > Inconsistent configuration. Previous configuration for using table name in > metrics: true, new configuration: false > {noformat} > The interesting thing is that we see it with default configurations, which > made me think that some code path must have been passing the wrong thing. I > found that if SchemaConfigured is passed a null Configuration in its > constructor that it will then pass null to SchemaMetrics#configureGlobally > which will interpret useTableName as being false: > {code} > public static void configureGlobally(Configuration conf) { > if (conf != null) { > final boolean useTableNameNew = > conf.getBoolean(SHOW_TABLE_NAME_CONF_KEY, false); > setUseTableName(useTableNameNew); > } else { > setUseTableName(false); > } > } > {code} > It should be set to true since that's the new default, meaning we missed it > in HBASE-5671. > I found one code path that passes a null configuration, StoreFile.Reader > extends SchemaConfigured and uses the constructor that only passes a Path, so > the Configuration is set to null. > I'm planning on just passing true instead of false, fixing the problem for > almost everyone (those that disable this feature will get the error message). > IMO it's not worth more efforts since it's a 0.94-only problem and it's not > actually doing anything bad. > I'm closing both HBASE-10990 and HBASE-10946 as duplicates. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11188) "Inconsistent configuration" for SchemaMetrics is always shown
[ https://issues.apache.org/jira/browse/HBASE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11188: --- Attachment: HBASE-11188-0.94.patch Proposing this one-liner patch, passes both TestSchemaConfigured and TestSchemaMetrics. > "Inconsistent configuration" for SchemaMetrics is always shown > -- > > Key: HBASE-11188 > URL: https://issues.apache.org/jira/browse/HBASE-11188 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 0.94.19 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.94.20 > > Attachments: HBASE-11188-0.94-v2.patch, HBASE-11188-0.94.patch > > > Some users have been complaining about this message: > {noformat} > ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: > Inconsistent configuration. Previous configuration for using table name in > metrics: true, new configuration: false > {noformat} > The interesting thing is that we see it with default configurations, which > made me think that some code path must have been passing the wrong thing. I > found that if SchemaConfigured is passed a null Configuration in its > constructor that it will then pass null to SchemaMetrics#configureGlobally > which will interpret useTableName as being false: > {code} > public static void configureGlobally(Configuration conf) { > if (conf != null) { > final boolean useTableNameNew = > conf.getBoolean(SHOW_TABLE_NAME_CONF_KEY, false); > setUseTableName(useTableNameNew); > } else { > setUseTableName(false); > } > } > {code} > It should be set to true since that's the new default, meaning we missed it > in HBASE-5671. > I found one code path that passes a null configuration, StoreFile.Reader > extends SchemaConfigured and uses the constructor that only passes a Path, so > the Configuration is set to null. > I'm planning on just passing true instead of false, fixing the problem for > almost everyone (those that disable this feature will get the error message). > IMO it's not worth more efforts since it's a 0.94-only problem and it's not > actually doing anything bad. > I'm closing both HBASE-10990 and HBASE-10946 as duplicates. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-11188) "Inconsistent configuration" for SchemaMetrics is always shown
[ https://issues.apache.org/jira/browse/HBASE-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-11188. Resolution: Fixed Release Note: Region servers with the default value for hbase.metrics.showTableName will stop showing the error message "ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false". Region servers configured with hbase.metrics.showTableName=false should now get a message like this one: "ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: Inconsistent configuration. Previous configuration for using table name in metrics: false, new configuration: true", and it's nothing to be concerned about. Hadoop Flags: Reviewed Grazie Matteo, I committed the patch to 0.94 > "Inconsistent configuration" for SchemaMetrics is always shown > -- > > Key: HBASE-11188 > URL: https://issues.apache.org/jira/browse/HBASE-11188 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 0.94.19 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.94.20 > > Attachments: HBASE-11188-0.94-v2.patch, HBASE-11188-0.94.patch > > > Some users have been complaining about this message: > {noformat} > ERROR org.apache.hadoop.hbase.regionserver.metrics.SchemaMetrics: > Inconsistent configuration. Previous configuration for using table name in > metrics: true, new configuration: false > {noformat} > The interesting thing is that we see it with default configurations, which > made me think that some code path must have been passing the wrong thing. I > found that if SchemaConfigured is passed a null Configuration in its > constructor that it will then pass null to SchemaMetrics#configureGlobally > which will interpret useTableName as being false: > {code} > public static void configureGlobally(Configuration conf) { > if (conf != null) { > final boolean useTableNameNew = > conf.getBoolean(SHOW_TABLE_NAME_CONF_KEY, false); > setUseTableName(useTableNameNew); > } else { > setUseTableName(false); > } > } > {code} > It should be set to true since that's the new default, meaning we missed it > in HBASE-5671. > I found one code path that passes a null configuration, StoreFile.Reader > extends SchemaConfigured and uses the constructor that only passes a Path, so > the Configuration is set to null. > I'm planning on just passing true instead of false, fixing the problem for > almost everyone (those that disable this feature will get the error message). > IMO it's not worth more efforts since it's a 0.94-only problem and it's not > actually doing anything bad. > I'm closing both HBASE-10990 and HBASE-10946 as duplicates. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-10990) Running CompressionTest yields unrelated schema metrics configuration errors.
[ https://issues.apache.org/jira/browse/HBASE-10990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-10990. Resolution: Duplicate I'm fixing this in HBASE-11188. > Running CompressionTest yields unrelated schema metrics configuration errors. > - > > Key: HBASE-10990 > URL: https://issues.apache.org/jira/browse/HBASE-10990 > Project: HBase > Issue Type: Bug >Affects Versions: 0.94.18 >Reporter: Srikanth Srungarapu >Assignee: Srikanth Srungarapu >Priority: Minor > Attachments: HBASE-10990.patch > > > In a vanilla configuration, running CompressionTest yields the following > error: > sudo -u hdfs hbase org.apache.hadoop.hbase.util.CompressionTest > /path/to/hfile gz > Output: > 13/03/07 14:49:40 ERROR metrics.SchemaMetrics: Inconsistent configuration. > Previous configuration for using table name in metrics: true, new > configuration: false -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-5671) hbase.metrics.showTableName should be true by default
[ https://issues.apache.org/jira/browse/HBASE-5671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999172#comment-13999172 ] Jean-Daniel Cryans commented on HBASE-5671: --- For anyone stumbling here because of this message "Inconsistent configuration. Previous configuration for using table name in metrics: true, new configuration: false", please see HBASE-11188. > hbase.metrics.showTableName should be true by default > - > > Key: HBASE-5671 > URL: https://issues.apache.org/jira/browse/HBASE-5671 > Project: HBase > Issue Type: Improvement > Components: metrics >Reporter: Enis Soztutar >Assignee: Enis Soztutar >Priority: Critical > Fix For: 0.94.0, 0.95.0 > > Attachments: HBASE-5671_v1.patch > > > HBASE-4768 added per-cf metrics and a new configuration option > "hbase.metrics.showTableName". We should switch the conf option to true by > default, since it is not intuitive (at least to me) to aggregate per-cf > across tables by default, and it seems confusing to report on cf's without > table names. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10504) Define Replication Interface
[ https://issues.apache.org/jira/browse/HBASE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997727#comment-13997727 ] Jean-Daniel Cryans commented on HBASE-10504: [~enis], sounds like a plan. Oh and something we should not forget is setting the right InterfaceAudience. [~gabriel.reid] you down with this? It means that this class (1) will need a few changes just to continue working like it currently does, and eventually a lot more will need to change to use the replication consumer but it should simplify the code a lot. It will also have an impact on the deployment since you won't need a fake slave cluster anymore. 1. https://github.com/NGDATA/hbase-indexer/blob/master/hbase-sep/hbase-sep-impl-0.98/src/main/java/com/ngdata/sep/impl/SepReplicationSource.java > Define Replication Interface > > > Key: HBASE-10504 > URL: https://issues.apache.org/jira/browse/HBASE-10504 > Project: HBase > Issue Type: Task >Reporter: stack >Assignee: stack >Priority: Blocker > Fix For: 0.99.0 > > Attachments: hbase-10504_wip1.patch > > > HBase has replication. Fellas have been hijacking the replication apis to do > all kinds of perverse stuff like indexing hbase content (hbase-indexer > https://github.com/NGDATA/hbase-indexer) and our [~toffer] just showed up w/ > overrides that replicate via an alternate channel (over a secure thrift > channel between dcs over on HBASE-9360). This issue is about surfacing these > APIs as public with guarantees to downstreamers similar to those we have on > our public client-facing APIs (and so we don't break them for downstreamers). > Any input [~phunt] or [~gabriel.reid] or [~toffer]? > Thanks. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10251) Restore API Compat for PerformanceEvaluation.generateValue()
[ https://issues.apache.org/jira/browse/HBASE-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10251: --- Resolution: Fixed Fix Version/s: (was: 0.98.2) (was: 0.98.1) (was: 0.98.0) 0.98.3 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to 0.98 and trunk, thanks Dima! > Restore API Compat for PerformanceEvaluation.generateValue() > > > Key: HBASE-10251 > URL: https://issues.apache.org/jira/browse/HBASE-10251 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.98.0, 0.98.1, 0.99.0, 0.98.2 >Reporter: Aleksandr Shulman >Assignee: Dima Spivak > Labels: api_compatibility > Fix For: 0.99.0, 0.98.3 > > Attachments: HBASE-10251-v2.patch, HBASE_10251.patch, > HBASE_10251_v3.patch > > > Observed: > A couple of my client tests fail to compile against trunk because the method > PerformanceEvaluation.generateValue was removed as part of HBASE-8496. > This is an issue because it was used in a number of places, including unit > tests. Since we did not explicitly label this API as private, it's ambiguous > as to whether this could/should have been used by people writing apps against > 0.96. If they used it, then they would be broken upon upgrade to 0.98 and > trunk. > Potential Solution: > The method was renamed to generateData, but the logic is still the same. We > can reintroduce it as deprecated in 0.98, as compat shim over generateData. > The patch should be a few lines. We may also consider doing so in trunk, but > I'd be just as fine with leaving it out. > More generally, this raises the question about what other code is in this > "grey-area", where it is public, is used outside of the package, but is not > explicitly labeled with an AudienceInterface. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HBASE-10251) Restore API Compat for PerformanceEvaluation.generateValue()
[ https://issues.apache.org/jira/browse/HBASE-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans reassigned HBASE-10251: -- Assignee: Dima Spivak (was: Jean-Daniel Cryans) God I hate jira, so easy to misclick. Reassigning Dima. > Restore API Compat for PerformanceEvaluation.generateValue() > > > Key: HBASE-10251 > URL: https://issues.apache.org/jira/browse/HBASE-10251 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.98.0, 0.98.1, 0.99.0, 0.98.2 >Reporter: Aleksandr Shulman >Assignee: Dima Spivak > Labels: api_compatibility > Fix For: 0.98.0, 0.98.1, 0.99.0, 0.98.2 > > Attachments: HBASE-10251-v2.patch, HBASE_10251.patch > > > Observed: > A couple of my client tests fail to compile against trunk because the method > PerformanceEvaluation.generateValue was removed as part of HBASE-8496. > This is an issue because it was used in a number of places, including unit > tests. Since we did not explicitly label this API as private, it's ambiguous > as to whether this could/should have been used by people writing apps against > 0.96. If they used it, then they would be broken upon upgrade to 0.98 and > trunk. > Potential Solution: > The method was renamed to generateData, but the logic is still the same. We > can reintroduce it as deprecated in 0.98, as compat shim over generateData. > The patch should be a few lines. We may also consider doing so in trunk, but > I'd be just as fine with leaving it out. > More generally, this raises the question about what other code is in this > "grey-area", where it is public, is used outside of the package, but is not > explicitly labeled with an AudienceInterface. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HBASE-10251) Restore API Compat for PerformanceEvaluation.generateValue()
[ https://issues.apache.org/jira/browse/HBASE-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans reassigned HBASE-10251: -- Assignee: Jean-Daniel Cryans (was: Dima Spivak) > Restore API Compat for PerformanceEvaluation.generateValue() > > > Key: HBASE-10251 > URL: https://issues.apache.org/jira/browse/HBASE-10251 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.98.0, 0.98.1, 0.99.0, 0.98.2 >Reporter: Aleksandr Shulman >Assignee: Jean-Daniel Cryans > Labels: api_compatibility > Fix For: 0.98.0, 0.98.1, 0.99.0, 0.98.2 > > Attachments: HBASE-10251-v2.patch, HBASE_10251.patch > > > Observed: > A couple of my client tests fail to compile against trunk because the method > PerformanceEvaluation.generateValue was removed as part of HBASE-8496. > This is an issue because it was used in a number of places, including unit > tests. Since we did not explicitly label this API as private, it's ambiguous > as to whether this could/should have been used by people writing apps against > 0.96. If they used it, then they would be broken upon upgrade to 0.98 and > trunk. > Potential Solution: > The method was renamed to generateData, but the logic is still the same. We > can reintroduce it as deprecated in 0.98, as compat shim over generateData. > The patch should be a few lines. We may also consider doing so in trunk, but > I'd be just as fine with leaving it out. > More generally, this raises the question about what other code is in this > "grey-area", where it is public, is used outside of the package, but is not > explicitly labeled with an AudienceInterface. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10251) Restore API Compat for PerformanceEvaluation.generateValue()
[ https://issues.apache.org/jira/browse/HBASE-10251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996651#comment-13996651 ] Jean-Daniel Cryans commented on HBASE-10251: Pass VALUE_LENGTH directly instead of re-declaring it. > Restore API Compat for PerformanceEvaluation.generateValue() > > > Key: HBASE-10251 > URL: https://issues.apache.org/jira/browse/HBASE-10251 > Project: HBase > Issue Type: Bug > Components: Client >Affects Versions: 0.98.0, 0.98.1, 0.99.0, 0.98.2 >Reporter: Aleksandr Shulman >Assignee: Dima Spivak > Labels: api_compatibility > Fix For: 0.98.0, 0.98.1, 0.99.0, 0.98.2 > > Attachments: HBASE_10251.patch > > > Observed: > A couple of my client tests fail to compile against trunk because the method > PerformanceEvaluation.generateValue was removed as part of HBASE-8496. > This is an issue because it was used in a number of places, including unit > tests. Since we did not explicitly label this API as private, it's ambiguous > as to whether this could/should have been used by people writing apps against > 0.96. If they used it, then they would be broken upon upgrade to 0.98 and > trunk. > Potential Solution: > The method was renamed to generateData, but the logic is still the same. We > can reintroduce it as deprecated in 0.98, as compat shim over generateData. > The patch should be a few lines. We may also consider doing so in trunk, but > I'd be just as fine with leaving it out. > More generally, this raises the question about what other code is in this > "grey-area", where it is public, is used outside of the package, but is not > explicitly labeled with an AudienceInterface. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10504) Define Replication Interface
[ https://issues.apache.org/jira/browse/HBASE-10504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995685#comment-13995685 ] Jean-Daniel Cryans commented on HBASE-10504: [~enis] So IIUC you are going a completely different way than what was discussed before, right? I don't mind it but I want to make sure that we're on the same track. So instead of having people define their sinks, they instead put that code in the ReplicationConsumer. Seems like a good idea, you don't have to mess around getting a fake cluster up and running, among other issues. Might still be nice to offer some tools like removeNonReplicableEdits(), the basic metrics the handling of a disabled peer, etc, for the other consumers. > Define Replication Interface > > > Key: HBASE-10504 > URL: https://issues.apache.org/jira/browse/HBASE-10504 > Project: HBase > Issue Type: Task >Reporter: stack >Assignee: stack >Priority: Blocker > Fix For: 0.99.0 > > Attachments: hbase-10504_wip1.patch > > > HBase has replication. Fellas have been hijacking the replication apis to do > all kinds of perverse stuff like indexing hbase content (hbase-indexer > https://github.com/NGDATA/hbase-indexer) and our [~toffer] just showed up w/ > overrides that replicate via an alternate channel (over a secure thrift > channel between dcs over on HBASE-9360). This issue is about surfacing these > APIs as public with guarantees to downstreamers similar to those we have on > our public client-facing APIs (and so we don't break them for downstreamers). > Any input [~phunt] or [~gabriel.reid] or [~toffer]? > Thanks. > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11143) ageOfLastShippedOp metric is confusing
[ https://issues.apache.org/jira/browse/HBASE-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995525#comment-13995525 ] Jean-Daniel Cryans commented on HBASE-11143: I'm +1 for the trunk patch too. > ageOfLastShippedOp metric is confusing > -- > > Key: HBASE-11143 > URL: https://issues.apache.org/jira/browse/HBASE-11143 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Lars Hofhansl >Assignee: Lars Hofhansl > Fix For: 0.99.0, 0.96.3, 0.94.20, 0.98.3 > > Attachments: 11143-0.94-v2.txt, 11143-0.94.txt, 11143-trunk.txt > > > We are trying to report on replication lag and find that there is no good > single metric to do that. > ageOfLastShippedOp is close, but unfortunately it is increased even when > there is nothing to ship on a particular RegionServer. > I would like discuss a few options here: > Add a new metric: replicationQueueTime (or something) with the above meaning. > I.e. if we have something to ship we set the age of that last shipped edit, > if we fail we increment that last time (just like we do now). But if there is > nothing to replicate we set it to current time (and hence that metric is > reported to close to 0). > Alternatively we could change the meaning of ageOfLastShippedOp to mean to do > that. That might lead to surprises, but the current behavior is clearly weird > when there is nothing to replicate. > Comments? [~jdcryans], [~stack]. > If approach sounds good, I'll make a patch for all branches. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11143) ageOfLastShippedOp metric is confusing
[ https://issues.apache.org/jira/browse/HBASE-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995242#comment-13995242 ] Jean-Daniel Cryans commented on HBASE-11143: +1, and can we get the new metric in 0.96+? > ageOfLastShippedOp metric is confusing > -- > > Key: HBASE-11143 > URL: https://issues.apache.org/jira/browse/HBASE-11143 > Project: HBase > Issue Type: Bug > Components: Replication >Reporter: Lars Hofhansl >Assignee: Lars Hofhansl > Fix For: 0.94.20 > > Attachments: 11143-0.94-v2.txt, 11143-0.94.txt > > > We are trying to report on replication lag and find that there is no good > single metric to do that. > ageOfLastShippedOp is close, but unfortunately it is increased even when > there is nothing to ship on a particular RegionServer. > I would like discuss a few options here: > Add a new metric: replicationQueueTime (or something) with the above meaning. > I.e. if we have something to ship we set the age of that last shipped edit, > if we fail we increment that last time (just like we do now). But if there is > nothing to replicate we set it to current time (and hence that metric is > reported to close to 0). > Alternatively we could change the meaning of ageOfLastShippedOp to mean to do > that. That might lead to surprises, but the current behavior is clearly weird > when there is nothing to replicate. > Comments? [~jdcryans], [~stack]. > If approach sounds good, I'll make a patch for all branches. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10993) Deprioritize long-running scanners
[ https://issues.apache.org/jira/browse/HBASE-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991179#comment-13991179 ] Jean-Daniel Cryans commented on HBASE-10993: I'm +1, there are a few things to fix in the comments but you can do that on commit (like "mantain"). BTW have you run this on a cluster where clients > handlers? > Deprioritize long-running scanners > -- > > Key: HBASE-10993 > URL: https://issues.apache.org/jira/browse/HBASE-10993 > Project: HBase > Issue Type: Sub-task >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Minor > Fix For: 0.99.0 > > Attachments: HBASE-10993-v0.patch, HBASE-10993-v1.patch, > HBASE-10993-v2.patch, HBASE-10993-v3.patch, HBASE-10993-v4.patch, > HBASE-10993-v4.patch > > > Currently we have a single call queue that serves all the "normal user" > requests, and the requests are executed in FIFO. > When running map-reduce jobs and user-queries on the same machine, we want to > prioritize the user-queries. > Without changing too much code, and not having the user giving hints, we can > add a “vtime” field to the scanner, to keep track from how long is running. > And we can replace the callQueue with a priorityQueue. In this way we can > deprioritize long-running scans, the longer a scan request lives the less > priority it gets. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11116) Allow ReplicationSink to write to tables when READONLY is set
[ https://issues.apache.org/jira/browse/HBASE-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990246#comment-13990246 ] Jean-Daniel Cryans commented on HBASE-6: I wonder if you could sort of enable that right now with ACLs. Basically, only the user that runs hbase can write to the tables on the slave cluster, everyone else is disallowed. > Allow ReplicationSink to write to tables when READONLY is set > - > > Key: HBASE-6 > URL: https://issues.apache.org/jira/browse/HBASE-6 > Project: HBase > Issue Type: New Feature > Components: regionserver, Replication >Reporter: Geoffrey Anderson > > A common practice with MySQL replication is to enable the read_only flag > (http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_read_only) > on slaves in order to disallow any user changes to a slave, while allowing > replication from the master to flow in. In HBase, enabling the READONLY > field on a table will cause ReplicationSink to fail to apply changes. It'd > be ideal to have the default behavior allow replication to continue going > through, a configurable option to influence this type of behavior, or perhaps > even a flag that manipulates a peer's state (e.g. via add_peer) that dictates > if replication can write to READONLY tables. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) And committed to 0.94, thanks everyone. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-0.94.patch, HBASE-11008-v2.patch, > HBASE-11008-v3.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10958: --- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Now committed to 0.94 (took some time because I wanted to double check the tests I modified were all green). Thanks everyone. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-10958-0.94.patch, > HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, > HBASE-10958-v3.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13986053#comment-13986053 ] Jean-Daniel Cryans commented on HBASE-10958: Sorry, went to do something else, lemme get that in (with HBASE-11008 first). > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-10958-0.94.patch, > HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, > HBASE-10958-v3.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10958: --- Release Note: Bulk loading with sequence IDs, an option in late 0.94 releases and the default since 0.96.0, will now trigger a flush per region that loads an HFile (if there's data that needs to flushed). Committed to 0.96 and up. Like for HBASE-11008, I'm waiting to commit to 0.94 or I can open a backport jira. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-10958-0.94.patch, > HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, > HBASE-10958-v3.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Release Note: preBulkLoadHFile now requires CREATE, which it effectively already needed since getTableDescriptor also requires it which is what LoadIncrementalHFiles is doing before bulk loading. compact and flush can now be issued by users with CREATE permission. I committed to 0.96 and up. I'm waiting on [~lhofhansl]'s +1 for 0.94 (or I can create a backport jira). > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-0.94.patch, HBASE-11008-v2.patch, > HBASE-11008-v3.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-10932) Improve RowCounter to allow mapper number set/control
[ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-10932. Resolution: Won't Fix Resolving as won't fix. If you want to work on a more general solution, like adding this option to the TIF, please open a new jira. Thanks. > Improve RowCounter to allow mapper number set/control > - > > Key: HBASE-10932 > URL: https://issues.apache.org/jira/browse/HBASE-10932 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Yu Li >Assignee: Yu Li >Priority: Minor > Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch > > > The typical use case of RowCounter is to do some kind of data integrity > checking, like after exporting some data from RDBMS to HBase, or from one > HBase cluster to another, making sure the row(record) number matches. Such > check commonly won't require much on response time. > Meanwhile, based on current impl, RowCounter will launch one mapper per > region, and each mapper will send one scan request. Assuming the table is > kind of big like having tens of regions, and the cpu core number of the whole > MR cluster is also enough, the parallel scan requests sent by mapper would be > a real burden for the HBase cluster. > So in this JIRA, we're proposing to make rowcounter support an additional > option "--maps" to specify mapper number, and make each mapper able to scan > more than one region of the target table. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control
[ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981271#comment-13981271 ] Jean-Daniel Cryans commented on HBASE-10932: bq. But I didn't quite catch the point of job scheduler, in my understanding job scheduler is cluster-level and cannot be configured per-job, right? Well, by using a scheduler, you can constrain certain types of jobs so that they don't run as fast as they can. For example, with the fair scheduler you can configure a pool (let's call it the "slow pool") to have only {{maxMaps}} running concurrently on the cluster. Then, when you run your {{RowCounter}} jobs and whatnot, you can tie them automatically to the slow pool. Hadoop cluster operators usually know how to use a scheduler, whereas having to rely on the person who runs the jobs to configure them correctly can lead to human errors like "oops I forgot to pass the maps configuration to my row counter and now the website is down". It also works well if you have two users who want to concurrently run a row counter; they'll both get in the slow pool and only two mappers will run (alternating between the two jobs, unless you set different weights because one user is more important than the other, etc etc). If you were to rely on individual users specifying the correct number of maps, and they both set their job to use two, then you'd have four mappers running. Back to square one. Anyways, all of this to say that there's a more generic way of doing this, and it already exists. Can we close this jira, [~carp84]? > Improve RowCounter to allow mapper number set/control > - > > Key: HBASE-10932 > URL: https://issues.apache.org/jira/browse/HBASE-10932 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Yu Li >Assignee: Yu Li >Priority: Minor > Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch > > > The typical use case of RowCounter is to do some kind of data integrity > checking, like after exporting some data from RDBMS to HBase, or from one > HBase cluster to another, making sure the row(record) number matches. Such > check commonly won't require much on response time. > Meanwhile, based on current impl, RowCounter will launch one mapper per > region, and each mapper will send one scan request. Assuming the table is > kind of big like having tens of regions, and the cpu core number of the whole > MR cluster is also enough, the parallel scan requests sent by mapper would be > a real burden for the HBase cluster. > So in this JIRA, we're proposing to make rowcounter support an additional > option "--maps" to specify mapper number, and make each mapper able to scan > more than one region of the target table. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10981) taskmonitor use subList may cause recursion and get a java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/HBASE-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10981: --- Resolution: Duplicate Status: Resolved (was: Patch Available) Yup, resolving as dup of HBASE-10312. > taskmonitor use subList may cause recursion and get a > java.lang.StackOverflowError > -- > > Key: HBASE-10981 > URL: https://issues.apache.org/jira/browse/HBASE-10981 > Project: HBase > Issue Type: Bug >Affects Versions: 0.98.0, 0.96.0, 0.96.2, 0.98.1, 0.96.1.1, 0.98.2 > Environment: java version "1.7.0_03" 64bit >Reporter: sinfox >Priority: Critical > Attachments: fixsubListException.patch > > Original Estimate: 5m > Remaining Estimate: 5m > > 2014-04-14 11:52:45,905 ERROR [RS_CLOSE_REGION-in16-062:60020-1] > executor.EventHandler: Caught throwable while processing event > M_RS_CLOSE_REGION > java.lang.RuntimeException: java.lang.StackOverflowError > at > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:161) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > Caused by: java.lang.StackOverflowError > at java.util.ArrayList$SubList.add(ArrayList.java:965) > at java.util.ArrayList$SubList.add(ArrayList.java:965) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-3489) .oldlogs not being cleaned out
[ https://issues.apache.org/jira/browse/HBASE-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977835#comment-13977835 ] Jean-Daniel Cryans commented on HBASE-3489: --- We removed the kill switch, start/stop_replication don't exist anymore. > .oldlogs not being cleaned out > -- > > Key: HBASE-3489 > URL: https://issues.apache.org/jira/browse/HBASE-3489 > Project: HBase > Issue Type: Bug >Affects Versions: 0.90.0 > Environment: 10 Nodes Write Heavy Cluster >Reporter: Wayne > Attachments: oldlog.txt > > > The .oldlogs folder is never being cleaned up. The > hbase.master.logcleaner.ttl has been set to clean up the old logs but the > clean up is never kicking in. The limit of 10 files is not the problem. After > running for 5 days not a single log file has ever been deleted and the > logcleaner is set to 2 days (from the default of 7 days). It is assumed that > the replication changes that want to be sure to keep these logs around if > needed have caused the cleanup to be blocked. There is no replication defined > (knowingly). > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977612#comment-13977612 ] Jean-Daniel Cryans commented on HBASE-11008: Thanks [~apurtell]. [~stack], [~lhofhansl], you guys ok for 0.96 and 0.94 respectively? > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-0.94.patch, HBASE-11008-v2.patch, > HBASE-11008-v3.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977605#comment-13977605 ] Jean-Daniel Cryans commented on HBASE-11008: Gently prodding for reviews and +1s, if we commit this then we can get HBASE-10958 in. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-0.94.patch, HBASE-11008-v2.patch, > HBASE-11008-v3.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Attachment: HBASE-11008-0.94.patch Attaching the patch for 0.94. I have to force using sequence ids else it won't exercise the flush that's coming in HBASE-10958. It doesn't change the semantics of the test itself. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-0.94.patch, HBASE-11008-v2.patch, > HBASE-11008-v3.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11043) Users with table's read/write permission can't get table's description
[ https://issues.apache.org/jira/browse/HBASE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976108#comment-13976108 ] Jean-Daniel Cryans commented on HBASE-11043: It's this way because of HBASE-8692. > Users with table's read/write permission can't get table's description > -- > > Key: HBASE-11043 > URL: https://issues.apache.org/jira/browse/HBASE-11043 > Project: HBase > Issue Type: Bug > Components: security >Affects Versions: 0.99.0 >Reporter: Liu Shaohui >Assignee: Liu Shaohui >Priority: Minor > Attachments: HBASE-11043-trunk-v1.diff > > > AccessController#preGetTableDescriptors only allow users with admin or create > permission to get table's description. > {quote} > requirePermission("getTableDescriptors", nameAsBytes, null, null, > Permission.Action.ADMIN, Permission.Action.CREATE); > {quote} > I think Users with table's read/write permission should also be able to get > table's description. > Eg: when create a hive table on HBase, hive will get the table description > to check if the mapping is right. Usually the hive users only have the read > permission of table. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10932) Improve RowCounter to allow mapper number set/control
[ https://issues.apache.org/jira/browse/HBASE-10932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13974306#comment-13974306 ] Jean-Daniel Cryans commented on HBASE-10932: Hey [~carp84], I forgot about this issue, let me address your latest replies. bq. I thought it's designed for purpose to make each mapper just scan one single region That's more an implementation detail than a design, and we can further improve the implementation by giving more control to the power users. bq. This is useful especially in multi-tenant env, when we need to check data integrity for one user after data importing meanwhile don't want the scan burden to slow down RT of other users' request. Right, but again, resource management is a broader issue. I doubt that RowCounter is the only job that needs to be throttled, what about VerifyReplication? Or Export? Those jobs usually aren't latency sensitive and can run in the background. This can be simply handled by a correctly configured job scheduler, that's what they do. > Improve RowCounter to allow mapper number set/control > - > > Key: HBASE-10932 > URL: https://issues.apache.org/jira/browse/HBASE-10932 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Yu Li >Assignee: Yu Li >Priority: Minor > Attachments: HBASE-10932_v1.patch, HBASE-10932_v2.patch > > > The typical use case of RowCounter is to do some kind of data integrity > checking, like after exporting some data from RDBMS to HBase, or from one > HBase cluster to another, making sure the row(record) number matches. Such > check commonly won't require much on response time. > Meanwhile, based on current impl, RowCounter will launch one mapper per > region, and each mapper will send one scan request. Assuming the table is > kind of big like having tens of regions, and the cpu core number of the whole > MR cluster is also enough, the parallel scan requests sent by mapper would be > a real burden for the HBase cluster. > So in this JIRA, we're proposing to make rowcounter support an additional > option "--maps" to specify mapper number, and make each mapper able to scan > more than one region of the target table. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Attachment: HBASE-11008-v3.patch New patch that fixes the documentation for flush and compact. This is good to go for 0.96 and up I think. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-v2.patch, HBASE-11008-v3.patch, > HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973789#comment-13973789 ] Jean-Daniel Cryans commented on HBASE-10958: bq. method = name.getMethodName(); Will fix (looks like I copied that from one of the few methods in that class that doesn't do it). > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-0.94.patch, > HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, > HBASE-10958-v3.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973653#comment-13973653 ] Jean-Daniel Cryans commented on HBASE-11008: I can, but I'm not sure where to put them. That table doesn't have the concept of operations having multiple perms. Flush and compact are now both ADMIN and CREATE. I can put them in ADMIN until we find a better representation. Sounds good? > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-v2.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10958: --- Attachment: HBASE-10958-0.94.patch HBASE-10958-v3.patch Attaching 2 patches. One is a backport for 0.94. While doing the backport I saw that a TestSnapshotFromMaster was failing and Matteo was able to see that it was an error in my patch, flushing always returned that it needed compaction. I need to rerun all the tests now but it was the only one that failed (haven't tried with security either). I also added a test in TestHRegion for that. The second patch is for trunk, in which I ported the same test to TestHRegion. Interestingly, it didn't work. I found that in 0.94 we compact if num_files > compactionThreshold, but in trunk it's >=, so it seems that we compact more often now. This patch also has the fixes from [~stack]'s comments. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-0.94.patch, > HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, > HBASE-10958-v3.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Attachment: HBASE-11008-v2.patch Patch with the modified documentation. FWIW I think that chapter needs a lot more love but it's out of scope. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008-v2.patch, HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11011) Avoid extra getFileStatus() calls on Region startup
[ https://issues.apache.org/jira/browse/HBASE-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973470#comment-13973470 ] Jean-Daniel Cryans commented on HBASE-11011: Alright, +1 > Avoid extra getFileStatus() calls on Region startup > --- > > Key: HBASE-11011 > URL: https://issues.apache.org/jira/browse/HBASE-11011 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.2, 0.98.1, 1.0.0 >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Minor > Fix For: 1.0.0, 0.98.2, 0.96.3 > > Attachments: HBASE-11011-v0.patch, HBASE-11011-v1.patch, > HBASE-11011-v2.patch > > > On load we already have a StoreFileInfo and we create it from the path, > this will result in an extra fs.getFileStatus() call. > In completeCompactionMarker() we do a fs.exists() and later a > fs.getFileStatus() > to create the StoreFileInfo, we can avoid the exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11011) Avoid extra getFileStatus() calls on Region startup
[ https://issues.apache.org/jira/browse/HBASE-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973419#comment-13973419 ] Jean-Daniel Cryans commented on HBASE-11011: Does the changed code in completeCompactionMarker require a unit test? Or is there already one? Also fix those lines: {quote} +// If we scan the directory and the file is not present, may means: +// so, we can't do anything with the "compaction output list" since or is +// already loaded on startup, because in the store folder, or it may be not {quote} > Avoid extra getFileStatus() calls on Region startup > --- > > Key: HBASE-11011 > URL: https://issues.apache.org/jira/browse/HBASE-11011 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.2, 0.98.1, 1.0.0 >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Minor > Fix For: 1.0.0, 0.98.2, 0.96.3 > > Attachments: HBASE-11011-v0.patch, HBASE-11011-v1.patch > > > On load we already have a StoreFileInfo and we create it from the path, > this will result in an extra fs.getFileStatus() call. > In completeCompactionMarker() we do a fs.exists() and later a > fs.getFileStatus() > to create the StoreFileInfo, we can avoid the exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11011) Avoid extra getFileStatus() calls on Region startup
[ https://issues.apache.org/jira/browse/HBASE-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973196#comment-13973196 ] Jean-Daniel Cryans commented on HBASE-11011: bq. In this case is not bad, is an "expected" situation, where the RS died before moving the compacted file. This might be, but the log message doesn't convey any of that. Taken out of context (so most users reading our logs), it just says a file is missing. bq. since is trying to lookup files in /table/region/family/ and if they are there are already loaded for sure. I'm trusting you on this one :) > Avoid extra getFileStatus() calls on Region startup > --- > > Key: HBASE-11011 > URL: https://issues.apache.org/jira/browse/HBASE-11011 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.2, 0.98.1, 1.0.0 >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Minor > Fix For: 1.0.0, 0.98.2, 0.96.3 > > Attachments: HBASE-11011-v0.patch > > > On load we already have a StoreFileInfo and we create it from the path, > this will result in an extra fs.getFileStatus() call. > In completeCompactionMarker() we do a fs.exists() and later a > fs.getFileStatus() > to create the StoreFileInfo, we can avoid the exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11011) Avoid extra getFileStatus() calls on Region startup
[ https://issues.apache.org/jira/browse/HBASE-11011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13973168#comment-13973168 ] Jean-Daniel Cryans commented on HBASE-11011: What should one do when seeing this message? {code} + LOG.debug("compaction file missing: " + outputPath); {code} The fact that a file is missing seems pretty bad, yet it's at DEBUG. > Avoid extra getFileStatus() calls on Region startup > --- > > Key: HBASE-11011 > URL: https://issues.apache.org/jira/browse/HBASE-11011 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.96.2, 0.98.1, 1.0.0 >Reporter: Matteo Bertozzi >Assignee: Matteo Bertozzi >Priority: Minor > Fix For: 1.0.0, 0.98.2, 0.96.3 > > Attachments: HBASE-11011-v0.patch > > > On load we already have a StoreFileInfo and we create it from the path, > this will result in an extra fs.getFileStatus() call. > In completeCompactionMarker() we do a fs.exists() and later a > fs.getFileStatus() > to create the StoreFileInfo, we can avoid the exists. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972130#comment-13972130 ] Jean-Daniel Cryans commented on HBASE-11008: bq. If we choose to make it more restrictive, The problem, as stated in HBASE-10958, is that bulk loading requires WRITE but effectively by going through {{LoadIncrementalHFiles}} it already requires CREATE. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972111#comment-13972111 ] Jean-Daniel Cryans commented on HBASE-10958: Forgot to reply to [~stack]'s comment: bq. flushSucceeded method (should it be isFlushSucceeded) and then you go get the result by accessing the data member directly. Minor inconsistency. flushSucceeded() isn't just looking up a field though, it's checking two things. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Attachment: HBASE-11008.patch Attaching a patch for trunk. Not planning on committing until we get more consensus on this issue. I do think that this is standalone from HBASE-10958 though, in the sense that these changes make sense to me on their own. > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
[ https://issues.apache.org/jira/browse/HBASE-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-11008: --- Status: Patch Available (was: Open) > Align bulk load, flush, and compact to require Action.CREATE > > > Key: HBASE-11008 > URL: https://issues.apache.org/jira/browse/HBASE-11008 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans > Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 > > Attachments: HBASE-11008.patch > > > Over in HBASE-10958 we noticed that it might make sense to require > Action.CREATE for bulk load, flush, and compact since it is also required for > things like enable and disable. > This means the following changes: > - preBulkLoadHFile goes from WRITE to CREATE > - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11008) Align bulk load, flush, and compact to require Action.CREATE
Jean-Daniel Cryans created HBASE-11008: -- Summary: Align bulk load, flush, and compact to require Action.CREATE Key: HBASE-11008 URL: https://issues.apache.org/jira/browse/HBASE-11008 Project: HBase Issue Type: Improvement Components: security Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.99.0, 0.98.2, 0.96.3, 0.94.20 Over in HBASE-10958 we noticed that it might make sense to require Action.CREATE for bulk load, flush, and compact since it is also required for things like enable and disable. This means the following changes: - preBulkLoadHFile goes from WRITE to CREATE - compact/flush go from ADMIN to ADMIN or CREATE -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971998#comment-13971998 ] Jean-Daniel Cryans commented on HBASE-10958: bq. The requirement on 'CREATE' for bulk load seems to come from here. Is this even intended? Thanks for doing this. There's also another place where this is called, splitStoreFile(), in order to get an HCD. I don't think it's necessary to call it twice, but it seems necessary to call it at least once else you can't: - verify that the families exist - get the schema for each family so that we create the HFiles with the correct configurations when splitting > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10871) Indefinite OPEN/CLOSE wait on busy RegionServers
[ https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971945#comment-13971945 ] Jean-Daniel Cryans commented on HBASE-10871: bq. Perhaps we should add some delay here. Well we drop the region on the floor, and the delay right now is the RIT timeout (30 minutes in 0.94, 10 minutes after). We don't do tight retries. > Indefinite OPEN/CLOSE wait on busy RegionServers > > > Key: HBASE-10871 > URL: https://issues.apache.org/jira/browse/HBASE-10871 > Project: HBase > Issue Type: Improvement > Components: Balancer, master, Region Assignment >Affects Versions: 0.94.6 >Reporter: Harsh J > > We observed a case where, when a specific RS got bombarded by a large amount > of regular requests, spiking and filling up its RPC queue, the balancer's > invoked unassigns and assigns for regions that dealt with this server entered > into an indefinite retry loop. > The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states > indefinitely cause of the HBase Client RPC from the ServerManager at the > master was running into SocketTimeouts. This caused a region unavailability > in the server for the affected regions. The timeout monitor retry default of > 30m in 0.94's AM compounded the waiting gap further a bit more (this is now > 10m in 0.95+'s new AM, and has further retries before we get there, which is > good). > Wonder if there's a way to improve this situation generally. PENDING_OPENs > may be easy to handle - we can switch them out and move them elsewhere. > PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a > way to "give up" permanently on a movement plan, and letting things be for a > while hoping for the RS to recover itself on its own (such that clients also > have a chance of getting things to work in the meantime)? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971753#comment-13971753 ] Jean-Daniel Cryans commented on HBASE-10958: bq. I can see that folks would not want to grant users CREATE (or ADMIN) so that they cannot create/drop/enable/disable tables, but still do allow them to load data via bulk load. That would now no longer possible. Bulk load already needs CREATE as shown above in TestAccessController. bq. Have we dismissed this option? I don't think so, but no one has been pushing for it. It creates a precedent, nowhere else in the code do we use User.runAs to override the current user that came in via a RPC (I'm not even sure if it works, but that's because I don't know that code very well). > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971584#comment-13971584 ] Jean-Daniel Cryans commented on HBASE-10958: bq. we'd like ADMIN to be granted sparingly to admins or delegates +1 bq. IIRC enable and disable are ADMIN actions also, since disabling or enabling a 1 region table has consequences. This is what I see in the code in trunk: {code} requirePermission("preBulkLoadHFile", getTableDesc().getTableName(), el.getFirst(), null, Permission.Action.WRITE); requirePermission("enableTable", tableName, null, null, Action.ADMIN, Action.CREATE); requirePermission("disableTable", tableName, null, null, Action.ADMIN, Action.CREATE); requirePermission("compact", getTableName(e.getEnvironment()), null, null, Action.ADMIN); requirePermission("flush", getTableName(e.getEnvironment()), null, null, Action.ADMIN); {code} IMO flush should have lower or same perms as disableTable. So here's a list of changes I believe are needed: - preBulkLoadHFile goes from WRITE to CREATE (seems more in line with what's really needed to bulk load given the code I posted yesterday) - compact/flush go from ADMIN to ADMIN or CREATE This should not have an impact on the current users. If we can agree on the changes, I'll open a new jira that's going to be blocking this one. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10871) Indefinite OPEN/CLOSE wait on busy RegionServers
[ https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970255#comment-13970255 ] Jean-Daniel Cryans commented on HBASE-10871: I got privy with the original issue's details, and I don't think that this part is accurate: bq. the balancer's invoked unassigns and assigns for regions that dealt with this server entered into an indefinite retry loop. What I saw from the logs is that the master got a {{SocketTimeoutException}} trying to send the openRegion() to the region server and fell into this part of the code in {{AssignmentManager}}: {code} } else if (t instanceof java.net.SocketTimeoutException && this.serverManager.isServerOnline(plan.getDestination())) { LOG.warn("Call openRegion() to " + plan.getDestination() + " has timed out when trying to assign " + region.getRegionNameAsString() + ", but the region might already be opened on " + plan.getDestination() + ".", t); return; {code} At that point we don't know if the region was opened or not, so we stop keeping track of it. Unfortunately, the 0.94 code has a default of 30 minutes for RIT timeout so it took that long to finally see this message in the log: "Region has been OFFLINE for too long, reassigning regionx to a random server". The looping isn't indefinite, it's just that a few of them were getting the socket timeout repeatedly. Some machines were in a bad state. There was also something strange with the call queues, I didn't see any mention of a connection reset so it doesn't seem like openRegion even made it to the queue. So it behaved the way we expect it to, although it's really not ideal, and stayed in limbo in either OFFLINE, PENDING_OPEN, or PENDING_CLOSE. Intuitively, I'd say we just ask the server directly if it has the region or not, but the fact that we got a socket timeout in the first place kinda tells us that that RS is currently unreachable. Maybe it's going to die, which is good since we'll know for sure the region is closed, but if it was opened then it might already be serving requests. I can think of a concept of having a list of regions in that special state that for every X seconds we try to contact the region server, but it seems brittle. FWIW the current situation of relying on the timeout isn't great either since the region could be open on the RS but we timeout anyways and double assign it. Pinging [~jxiang] since he loves that part of the code for more inputs :) > Indefinite OPEN/CLOSE wait on busy RegionServers > > > Key: HBASE-10871 > URL: https://issues.apache.org/jira/browse/HBASE-10871 > Project: HBase > Issue Type: Improvement > Components: Balancer, master, Region Assignment >Affects Versions: 0.94.6 >Reporter: Harsh J > > We observed a case where, when a specific RS got bombarded by a large amount > of regular requests, spiking and filling up its RPC queue, the balancer's > invoked unassigns and assigns for regions that dealt with this server entered > into an indefinite retry loop. > The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states > indefinitely cause of the HBase Client RPC from the ServerManager at the > master was running into SocketTimeouts. This caused a region unavailability > in the server for the affected regions. The timeout monitor retry default of > 30m in 0.94's AM compounded the waiting gap further a bit more (this is now > 10m in 0.95+'s new AM, and has further retries before we get there, which is > good). > Wonder if there's a way to improve this situation generally. PENDING_OPENs > may be easy to handle - we can switch them out and move them elsewhere. > PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a > way to "give up" permanently on a movement plan, and letting things be for a > while hoping for the RS to recover itself on its own (such that clients also > have a chance of getting things to work in the meantime)? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970228#comment-13970228 ] Jean-Daniel Cryans commented on HBASE-10958: bq. Wait. Are you saying the region server itself cannot issue a flush as part of the bulkLoadHFiles RPC? Exact. That's why the User.runAs (and then run as the region server itself) solution seems to make sense to me. I read some more code, and it seems bulk load actually requires CREATE even though the call itself requires WRITE. From TestAccessController: {code} // User performing bulk loads must have privilege to read table metadata // (ADMIN or CREATE) verifyAllowed(bulkLoadAction, SUPERUSER, USER_ADMIN, USER_OWNER, USER_CREATE); verifyDenied(bulkLoadAction, USER_RW, USER_NONE, USER_RO); {code} So another options is to set flush and compact as CREATE actions (in line with create table, alter, disable, enable, delete). > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970173#comment-13970173 ] Jean-Daniel Cryans commented on HBASE-10958: Solutions on top of my head: - Do doAs inside the method. The rationale being that it's the region server that's trying to do that here, not the user. I'm not sure if this is doable, and [~mbertozzi] is laughing at me. - Matteo thinks that we should just make bulk load an ADMIN method. I agree but I'm not fond of breaking secure setups that use bulk loads. [~apurtell]? > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970166#comment-13970166 ] Jean-Daniel Cryans commented on HBASE-10958: Right now bulk loading is WRITE and flush is ADMIN. Problematic! > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-10312) Flooding the cluster with administrative actions leads to collapse
[ https://issues.apache.org/jira/browse/HBASE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-10312: --- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to 0.99/8/6/4, thanks for the inputs guys. > Flooding the cluster with administrative actions leads to collapse > -- > > Key: HBASE-10312 > URL: https://issues.apache.org/jira/browse/HBASE-10312 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: Jean-Daniel Cryans >Priority: Critical > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10312-0.94.patch, HBASE-10312.patch > > > Steps to reproduce: > 1. Start a cluster. > 2. Start an ingest process. > 3. In the HBase shell, do this: > {noformat} > while true do >flush 'table' > end > {noformat} > We should reject abuse via administrative requests like this. > What happens on the cluster is the requests back up, leading to lots of these: > {noformat} > 2014-01-10 18:55:55,293 WARN [Priority.RpcServer.handler=2,port=8120] > monitoring.TaskMonitor: Too many actions in action monitor! Purging some. > {noformat} > At this point we could lower a gate on further requests for actions until the > backlog clears. > Continuing, all of the regionservers will eventually die with a > StackOverflowError of unknown origin because, stack overflow: > {noformat} > 2014-01-10 19:02:02,783 ERROR [Priority.RpcServer.handler=3,port=8120] > ipc.RpcServer: Unexpected throwable object java.lang.StackOverflowError > at java.util.ArrayList$SubList.add(ArrayList.java:965) > [...] > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970158#comment-13970158 ] Jean-Daniel Cryans commented on HBASE-10958: Yeah, and bulk loading itself is kind of like a flush since you end up with an HFile. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10312) Flooding the cluster with administrative actions leads to collapse
[ https://issues.apache.org/jira/browse/HBASE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970152#comment-13970152 ] Jean-Daniel Cryans commented on HBASE-10312: Doing now. > Flooding the cluster with administrative actions leads to collapse > -- > > Key: HBASE-10312 > URL: https://issues.apache.org/jira/browse/HBASE-10312 > Project: HBase > Issue Type: Bug >Reporter: Andrew Purtell >Assignee: Jean-Daniel Cryans >Priority: Critical > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10312-0.94.patch, HBASE-10312.patch > > > Steps to reproduce: > 1. Start a cluster. > 2. Start an ingest process. > 3. In the HBase shell, do this: > {noformat} > while true do >flush 'table' > end > {noformat} > We should reject abuse via administrative requests like this. > What happens on the cluster is the requests back up, leading to lots of these: > {noformat} > 2014-01-10 18:55:55,293 WARN [Priority.RpcServer.handler=2,port=8120] > monitoring.TaskMonitor: Too many actions in action monitor! Purging some. > {noformat} > At this point we could lower a gate on further requests for actions until the > backlog clears. > Continuing, all of the regionservers will eventually die with a > StackOverflowError of unknown origin because, stack overflow: > {noformat} > 2014-01-10 19:02:02,783 ERROR [Priority.RpcServer.handler=3,port=8120] > ipc.RpcServer: Unexpected throwable object java.lang.StackOverflowError > at java.util.ArrayList$SubList.add(ArrayList.java:965) > [...] > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970118#comment-13970118 ] Jean-Daniel Cryans commented on HBASE-10958: That's a real test failure, the user that bulk loads doesn't have the permission to flush. Looking. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-10958) [dataloss] Bulk loading with seqids can prevent some log entries from being replayed
[ https://issues.apache.org/jira/browse/HBASE-10958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970015#comment-13970015 ] Jean-Daniel Cryans commented on HBASE-10958: Oh and I found that testRegionMadeOfBulkLoadedFilesOnly wasn't really testing WAL replay with a bulk loaded file, the KV was being added at LATEST_TIMESTAMP so it wasn't visible being so far in the future. I made it so we check that we got all the rows back. > [dataloss] Bulk loading with seqids can prevent some log entries from being > replayed > > > Key: HBASE-10958 > URL: https://issues.apache.org/jira/browse/HBASE-10958 > Project: HBase > Issue Type: Bug >Affects Versions: 0.96.2, 0.98.1, 0.94.18 >Reporter: Jean-Daniel Cryans >Assignee: Jean-Daniel Cryans >Priority: Blocker > Fix For: 0.99.0, 0.94.19, 0.98.2, 0.96.3 > > Attachments: HBASE-10958-less-intrusive-hack-0.96.patch, > HBASE-10958-quick-hack-0.96.patch, HBASE-10958-v2.patch, HBASE-10958.patch > > > We found an issue with bulk loads causing data loss when assigning sequence > ids (HBASE-6630) that is triggered when replaying recovered edits. We're > nicknaming this issue *Blindspot*. > The problem is that the sequence id given to a bulk loaded file is higher > than those of the edits in the region's memstore. When replaying recovered > edits, the rule to skip some of them is that they have to be _lower than the > highest sequence id_. In other words, the edits that have a sequence id lower > than the highest one in the store files *should* have also been flushed. This > is not the case with bulk loaded files since we now have an HFile with a > sequence id higher than unflushed edits. > The log recovery code takes this into account by simply skipping the bulk > loaded files, but this "bulk loaded status" is *lost* on compaction. The > edits in the logs that have a sequence id lower than the bulk loaded file > that got compacted are put in a blind spot and are skipped during replay. > Here's the easiest way to recreate this issue: > - Create an empty table > - Put one row in it (let's say it gets seqid 1) > - Bulk load one file (it gets seqid 2). I used ImporTsv and set > hbase.mapreduce.bulkload.assign.sequenceNumbers. > - Bulk load a second file the same way (it gets seqid 3). > - Major compact the table (the new file has seqid 3 and isn't considered > bulk loaded). > - Kill the region server that holds the table's region. > - Scan the table once the region is made available again. The first row, at > seqid 1, will be missing since the HFile with seqid 3 makes us believe that > everything that came before it was flushed. -- This message was sent by Atlassian JIRA (v6.2#6252)