[jira] [Resolved] (KUDU-26) Handle corrupt Tablets at startup
[ https://issues.apache.org/jira/browse/KUDU-26?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wong resolved KUDU-26. - Fix Version/s: 1.5.0 Resolution: Fixed This was fixed a while ago in a series of commits. > Handle corrupt Tablets at startup > - > > Key: KUDU-26 > URL: https://issues.apache.org/jira/browse/KUDU-26 > Project: Kudu > Issue Type: Improvement > Components: tablet, tserver >Affects Versions: M3 >Reporter: Todd Lipcon >Priority: Major > Fix For: 1.5.0 > > > Currently if any tablet fails to load at startup, the whole tserver fails to > start. Instead, it should mark those tablets as being hosted by the server, > but in a CORRUPT state. This will help admins find the issue and hopefully > debug/recover, instead of causing cluster-wide downtime. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2971) Add a generic Java library wrapper
[ https://issues.apache.org/jira/browse/KUDU-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977938#comment-16977938 ] ASF subversion and git services commented on KUDU-2971: --- Commit 68f9fbc420a7ade895ffa639971978713fbbee4f in kudu's branch refs/heads/master from Hao Hao [ https://gitbox.apache.org/repos/asf?p=kudu.git;h=68f9fbc ] KUDU-2971 p1: add subprocess module Utility classes exist that allow for IPC over stdin/stdout via protobuf and JSON-encoded protobuf. This commit moves those classes into their own directory so it can be reused by other subprocesses. Following commits can then extend it to support concurrent communications with subprocess. There are no functional changes in this patch. Change-Id: If73e27772e1897a04f04229c4906a24c61e361f2 Reviewed-on: http://gerrit.cloudera.org:8080/14425 Tested-by: Kudu Jenkins Reviewed-by: Andrew Wong > Add a generic Java library wrapper > -- > > Key: KUDU-2971 > URL: https://issues.apache.org/jira/browse/KUDU-2971 > Project: Kudu > Issue Type: Sub-task >Affects Versions: 1.11.0 >Reporter: Hao Hao >Assignee: Hao Hao >Priority: Major > > For Ranger integration, to call Java Ranger plugin from masters, we need a > create a wrapper (via Java subprocess). This should be generic to be used by > future integrations (e.g. Atlas) which need to call other Java libraries. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KUDU-3003) TestAsyncKuduSession.testTabletCacheInvalidatedDuringWrites is flaky
Hao Hao created KUDU-3003: - Summary: TestAsyncKuduSession.testTabletCacheInvalidatedDuringWrites is flaky Key: KUDU-3003 URL: https://issues.apache.org/jira/browse/KUDU-3003 Project: Kudu Issue Type: Bug Reporter: Hao Hao Attachments: test-output.txt testTabletCacheInvalidatedDuringWrites of the org.apache.kudu.client.TestAsyncKuduSession test sometimes fails with an error like below. I attached full test log. {noformat} There was 1 failure: 1) testTabletCacheInvalidatedDuringWrites(org.apache.kudu.client.TestAsyncKuduSession) org.apache.kudu.client.PleaseThrottleException: all buffers are currently flushing at org.apache.kudu.client.AsyncKuduSession.apply(AsyncKuduSession.java:579) at org.apache.kudu.client.TestAsyncKuduSession.testTabletCacheInvalidatedDuringWrites(TestAsyncKuduSession.java:371) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KUDU-3002) consider compactions as a mechanism to flush many DMSs
[ https://issues.apache.org/jira/browse/KUDU-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wong updated KUDU-3002: -- Description: When under memory pressure, we'll aggressively perform the maintenance operation that frees the most memory. Right now, the only ops that register memory are MRS and DMS flushes. In practice, this means a couple things: * In most cases, we'll prioritize flushing MRSs way ahead of flushing DMS, since updates are spread across many DMSs and will therefore tend to be small, whereas any non-trivial insert workload will well up into a single MRS for an entire tablet * We'll only flush a single DMS at a time to free memory. Because of this, and because we'll likely prioritize MRS flushes over DMS flushes, we may end up with a ton of tiny DMSs in a tablet that we'll never flush. This can end up bloating the WALs because each DMS may be anchoring some WAL segments. A couple thoughts on small things we can do to improve this: * Register the DMS size as ram anchored by a compaction. This will meant that we can schedule compactions to flush DMSs en masse. This would still mean that we could end up always prioritizing MRS flushes, depending on how quickly we're inserting. * We currently register the amount disk space an LogGC would free up. We could do something similar, but register how many log anchors an op could release. This would be a bit trickier, since the log anchors aren't solely determined by the mem-stores (e.g. we'll anchor segments to catch up slow followers). * Introduce a new op (or change the flush DMS op) that would flush as many DMSs as we can for a given tablet. Between these, the first seems like it'd be an easy win. was: When under memory pressure, we'll aggressively perform the maintenance operation that frees the most memory. Right now, the only ops that register memory are MRS and DMS flushes. In practice, this means a couple things: * In most cases, we'll prioritize flushing MRSs way ahead of flushing DMS, since updates are spread across many DMSs and will therefore tend to be small, whereas any non-trivial insert workload will well up into a single MRS for an entire tablet * We'll only flush a single DMS at a time to free memory. Because of this, and because we'll likely prioritize MRS flushes over DMS flushes, we may end up with a ton of tiny DMSs in a tablet that we'll never flush. This can end up bloating the WALs because each DMS may be anchoring some WAL segments. A couple thoughts on small things we can do to improve this: * Register the DMS size as ram anchored by a compaction. This will meant that we can schedule compactions to flush DMSs en masse. This would still mean that we could end up always prioritizing MRS flushes, depending on how quickly we're inserting. * We currently register the amount disk space an LogGC would free up. We could do something similar, but register how many log anchors an op could release. This would be a bit trickier, since the log anchors aren't solely determined by the mem-stores (e.g. we'll anchor segments to catch up slow followers). Between the two, the first seems like it'd be an easy win. > consider compactions as a mechanism to flush many DMSs > -- > > Key: KUDU-3002 > URL: https://issues.apache.org/jira/browse/KUDU-3002 > Project: Kudu > Issue Type: Improvement > Components: perf, tablet >Reporter: Andrew Wong >Priority: Major > > When under memory pressure, we'll aggressively perform the maintenance > operation that frees the most memory. Right now, the only ops that register > memory are MRS and DMS flushes. > In practice, this means a couple things: > * In most cases, we'll prioritize flushing MRSs way ahead of flushing DMS, > since updates are spread across many DMSs and will therefore tend to be > small, whereas any non-trivial insert workload will well up into a single MRS > for an entire tablet > * We'll only flush a single DMS at a time to free memory. Because of this, > and because we'll likely prioritize MRS flushes over DMS flushes, we may end > up with a ton of tiny DMSs in a tablet that we'll never flush. This can end > up bloating the WALs because each DMS may be anchoring some WAL segments. > A couple thoughts on small things we can do to improve this: > * Register the DMS size as ram anchored by a compaction. This will meant > that we can schedule compactions to flush DMSs en masse. This would still > mean that we could end up always prioritizing MRS flushes, depending on how > quickly we're inserting. > * We currently register the amount disk space an LogGC would free up. We > could do something similar, but register how many log anchors an op could > release. This would be a bit trickier, since the log anchors aren't solely
[jira] [Created] (KUDU-3002) consider compactions as a mechanism to flush many DMSs
Andrew Wong created KUDU-3002: - Summary: consider compactions as a mechanism to flush many DMSs Key: KUDU-3002 URL: https://issues.apache.org/jira/browse/KUDU-3002 Project: Kudu Issue Type: Improvement Components: perf, tablet Reporter: Andrew Wong When under memory pressure, we'll aggressively perform the maintenance operation that frees the most memory. Right now, the only ops that register memory are MRS and DMS flushes. In practice, this means a couple things: * In most cases, we'll prioritize flushing MRSs way ahead of flushing DMS, since updates are spread across many DMSs and will therefore tend to be small, whereas any non-trivial insert workload will well up into a single MRS for an entire tablet * We'll only flush a single DMS at a time to free memory. Because of this, and because we'll likely prioritize MRS flushes over DMS flushes, we may end up with a ton of tiny DMSs in a tablet that we'll never flush. This can end up bloating the WALs because each DMS may be anchoring some WAL segments. A couple thoughts on small things we can do to improve this: * Register the DMS size as ram anchored by a compaction. This will meant that we can schedule compactions to flush DMSs en masse. This would still mean that we could end up always prioritizing MRS flushes, depending on how quickly we're inserting. * We currently register the amount disk space an LogGC would free up. We could do something similar, but register how many log anchors an op could release. This would be a bit trickier, since the log anchors aren't solely determined by the mem-stores (e.g. we'll anchor segments to catch up slow followers). Between the two, the first seems like it'd be an easy win. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-2929) Don't starve compactions under memory pressure
[ https://issues.apache.org/jira/browse/KUDU-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977885#comment-16977885 ] Andrew Wong commented on KUDU-2929: --- Also [~ZhangYao] good point about registering the DMS size as anchored memory. That seems to be an important point about update-heavy workloads. I've filed KUDU-3002 for it. > Don't starve compactions under memory pressure > -- > > Key: KUDU-2929 > URL: https://issues.apache.org/jira/browse/KUDU-2929 > Project: Kudu > Issue Type: Improvement > Components: perf, tablet >Reporter: Andrew Wong >Assignee: Andrew Wong >Priority: Major > Fix For: 1.12.0 > > > When a server is under memory pressure, the maintenance manager exclusively > will look for the maintenance op that frees up the most memory. Some > operations, like compactions, do not register any amount of "anchored memory" > and effectively don't qualify for consideration. > This means that when a tablet server is under memory pressure, compactions > will never be scheduled, even though compacting may actually end up reducing > memory (e.g. combining many rowsets-worth of CFileReaders into a single > rowset). While it makes sense to prefer flushes to compactions, it probably > doesn't make sense to do nothing vs compact. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KUDU-38) bootstrap should not replay logs that are known to be fully flushed
[ https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1695#comment-1695 ] Todd Lipcon commented on KUDU-38: - bq. Guaranteeing that every complete segment has a fully sync'ed index file makes for a nice invariant, but isn't it overkill for the task at hand? Couldn't we get away with sync'ing whichever index file contains the earliest anchored index at TabletMetadata flush time? I'm particularly concerned about the backwards compatibility implications: how do we establish this invariant after upgrading to a release including this fix? Or, how do we detect that it's not present in existing log index files? I think we need to make sure that all prior indexes are also synced, because it's possible that there is a lagging peer that will still need to catch up from a very old record. The index file is what allows a leader to go find those old log entries and send them along. Without it, the old log segments aren't useful. bq. I'm particularly concerned about the backwards compatibility implications: how do we establish this invariant after upgrading to a release including this fix? Or, how do we detect that it's not present in existing log index files? Yep, we'd need to take that into account, eg by adding some new flag to the tablet metadata indicating that the indexes are durable or somesuch. bq. Alternatively, what about forgoing the log index file and rather than storing the earliest anchored index in the TabletMetadata, storing the "physical index" (i.e. the LogIndexEntry corresponding to the anchor)? Again per above, we can't be alive with an invalid index file, or else consensus won't be happy. Ignoring the rest of your questions for a minute, let me throw out an alternative idea or two: *Option 1:* We could add a new separate piece of metadata next to the logs called a "sync point" or somesuch. (this could even be at a well-known offset in the existing log file or something). We can periodically wake up a background process in a log (eg when we see that the sync point is too far back) and then: (1) look up the earliest durability-anchored offset (2) msync the log indexes up to that point (3) write that point to the special "sync point" metadata file. This is just an offset, so it can be written atomically and lazily flushed (it only moves forward) At startup, if we see a sync point metadata file, we know we can start replaying (and reconstructing index) from that point, without having to reconstruct any earlier index entries. If we do this lazily (eg once every few seconds only on actively-written tablets) the performance overhead should be negligible. We also need to think about how this interacts with tablet copy -- right now, a newly copied tablet relies on replaying the WALs from the beginning because it doesn't copy log indexes. We may need to change that. *Option 2*: get rid of "log index" This is the "nuke everything from orbit" option: the whole log index thing was convenient but it's somewhat annoying for a number of reasons: (1) the issues described here, (2) we are using mmapped IO which is dangerous since IO errors crash the process, (3) just another bit of code to worry about and transfer around in tablet copy, etc. The alternative is to embed the index in the WAL itself. One sketch of an implementation would be something like: - divide the WAL into fixed-size pages, each with a header. The header would have term/index info and some kind of "continuation" flag for when entries span multiple pages. This is more or less the postgres WAL design - this allows us to binary-search the WAL instead of having a separate index. - we have to consider how truncations work -- I guess we would move to physical truncation. Another possible idea would be to not use fixed-size pages, but instead embed a tree structure into the WAL itself. For example, it wouldn't be too tough to add a back-pointer from each entry to the previous entry to enable backward scanning. If we then take a skip-list like approach (n/2 nodes have a skip-1 pointer, n/4 nodes have a skip-4 pointer, n/8 nodes have a skip-8 pointer, etc) then we can get logarithmic access time to past log entries. Again, need to consider truncation. Either of these options have the advantage that we no longer need to worry about indexes, but we still do need to worry about figuring out where to start replaying from, and we could take the same strategy as the first suggestion for that. > bootstrap should not replay logs that are known to be fully flushed > --- > > Key: KUDU-38 > URL: https://issues.apache.org/jira/browse/KUDU-38 > Project: Kudu > Issue Type: Sub-task > Components: tablet >Affects Versions: M3 >Reporter: Todd Lipcon >
[jira] [Created] (KUDU-3001) Multi-thread to load containers in a data directory
Yingchun Lai created KUDU-3001: -- Summary: Multi-thread to load containers in a data directory Key: KUDU-3001 URL: https://issues.apache.org/jira/browse/KUDU-3001 Project: Kudu Issue Type: Improvement Reporter: Yingchun Lai Assignee: Yingchun Lai As what [~tlipcon] mentioned in https://issues.apache.org/jira/browse/KUDU-2014, we can improve tserver startup time by load containers in a data directoty by multiple threads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KUDU-2453) kudu should stop creating tablet infinitely
[ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977296#comment-16977296 ] Yingchun Lai edited comment on KUDU-2453 at 11/19/19 9:25 AM: -- We also happend to see this issue, I created another Jira to trace it, and also gave some ideas to resolve it. was (Author: acelyc111): We also happend to see this issue, I created another Jira to trace it, and also give some ideas to resolve it. > kudu should stop creating tablet infinitely > --- > > Key: KUDU-2453 > URL: https://issues.apache.org/jira/browse/KUDU-2453 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.4.0, 1.7.2 >Reporter: LiFu He >Priority: Major > > I have met this problem again on 2018/10/26. And now the kudu version is > 1.7.2. > - > We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and > there are some load on the kudu cluster. Then someone else created a big > table which had tens of thousands of tablets from impala-shell (that was a > mistake). > {code:java} > CREATE TABLE XXX( > ... >PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 100, > RANGE (...) > ( > PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", > PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", > ... > PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'= '...'); > {code} > Here are the logs after creating table (only pick one tablet as example): > {code:java} > --Kudu-master log > ==e884bda6bbd3482f94c07ca0f34f99a4== > W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC > failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:50247 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS > 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) > ... > ==Be replaced by 0b144c00f35d48cca4d4981698faef72== > W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 > ... > I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T > P f6c9a09da7ef4fc191cab6276b942ba3: Sending > DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 > ... > I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) > ... > W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 > already in progress: creating tablet > ... > I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of > e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) > ... > W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS > b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC > failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:59735 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS > b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) > ... > ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75== > W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed >
[jira] [Commented] (KUDU-2453) kudu should stop creating tablet infinitely
[ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977296#comment-16977296 ] Yingchun Lai commented on KUDU-2453: We also happend to see this issue, I created another Jira to trace it, and also give some ideas to resolve it. > kudu should stop creating tablet infinitely > --- > > Key: KUDU-2453 > URL: https://issues.apache.org/jira/browse/KUDU-2453 > Project: Kudu > Issue Type: Bug > Components: master, tserver >Affects Versions: 1.4.0, 1.7.2 >Reporter: LiFu He >Priority: Major > > I have met this problem again on 2018/10/26. And now the kudu version is > 1.7.2. > - > We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and > there are some load on the kudu cluster. Then someone else created a big > table which had tens of thousands of tablets from impala-shell (that was a > mistake). > {code:java} > CREATE TABLE XXX( > ... >PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 100, > RANGE (...) > ( > PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", > PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", > ... > PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'= '...'); > {code} > Here are the logs after creating table (only pick one tablet as example): > {code:java} > --Kudu-master log > ==e884bda6bbd3482f94c07ca0f34f99a4== > W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC > failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:50247 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS > 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) > ... > ==Be replaced by 0b144c00f35d48cca4d4981698faef72== > W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 > ... > I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T > P f6c9a09da7ef4fc191cab6276b942ba3: Sending > DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 > ... > I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) > ... > W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 > already in progress: creating tablet > ... > I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of > e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) > ... > W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS > b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC > failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:59735 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS > b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) > ... > ==Be replaced by c0e0acc448fc42fc9e48f5025b112a75== > W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T > P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 > ... > --Kudu-tserver log > I1024 11:40:52.014571 137358