Adar, Andrew, thanks for your detailed and prompt replies. "Fortunately" (for your questions) we have another TS whose WALs disk is currently about 80% full (and three more whose WALs disk is above > 50%), and I suspect that it will be the next one we'll have to restart in a few nights.
On this TS the output from 'lsof' this morning shows 175,550 files open, of which 84,876 are Kudu data files and 90,340 are Kudu WALs. For this server these are the top 10 tablets by WAL sizes (in kB): du -sk * | sort -nr | head -10 31400552 b3facf00fcff403293d36c1032811e6e 31204488 354088f7954047908b4e68e0627836b8 30584928 90a0465b7b4f4ed7a3c5e43d993cf52e 30536168 c5369eb17772432bbe326abf902c8055 30535900 4540cba1b331429a8cbacf08acf2a321 30503820 cb302d0772a64a5795913196bdf43ed3 30428040 f318cbf0cdcc4e10979fc1366027dde5 30379552 4779c42e5ef6493ba4d51dc44c97f7f7 29671692 a4ac51eefc59467abf45938529080c17 29539940 b00cb81794594d9b8366980a24bf79ad and these are the top 10 tablets by number of WAL segments: for t in *; do echo "$(ls $t | grep -c '^wal-') $t"; done | sort -nr | head -10 3813 b3facf00fcff403293d36c1032811e6e 3784 354088f7954047908b4e68e0627836b8 3716 90a0465b7b4f4ed7a3c5e43d993cf52e 3705 c5369eb17772432bbe326abf902c8055 3705 4540cba1b331429a8cbacf08acf2a321 3700 cb302d0772a64a5795913196bdf43ed3 3698 f318cbf0cdcc4e10979fc1366027dde5 3685 4779c42e5ef6493ba4d51dc44c97f7f7 3600 a4ac51eefc59467abf45938529080c17 3585 b00cb81794594d9b8366980a24bf79ad as you can see, the largest tablets by WAL size are also the largest ones by number of WAL segments. Taking a more detailed look at the largest of these tablets (b3facf00fcff403293d36c1032811e6e), these are the TS's that host a replica of that tablet from the output of the command 'kudu table list': T b3facf00fcff403293d36c1032811e6e L e4a4195a39df41f0b04887fdcae399d8 ts07:7050 V 147fcef6fb49437aa19f7a95fb26c091 ts11:7050 V 59fe260f21da48059ff5683c364070ce ts31:7050 where ts07 (the leader) is the TS whose WALs disk is about 80% full. I looked at the 'ops_behind_leader' metric for that tablet on the other two TS's (ts11 and ts31) by querying their metrics, and they are both 0. As for the memory pressure, the leader (ts07) shows the following metrics: leader_memory_pressure_rejections": 22397 transaction_memory_pressure_rejections: 0 follower_memory_pressure_rejections: 0 Finally a couple of non-technical comments about KUDU-3002 (https://issues.apache.org/jira/browse/KUDU-3002): - I can see it has been fixed in Kudu 1.12.0; however we (as probably most other enterprise customers) depend on a vendor distribution, so it won't really be available to us until the vendor packages it (I think the current version of Kudu in their runtime is 1.11.0, so I guess 1.12.0 could only be a month or two away) - The other major problem we have is that vendors distributions like the one we are using bundle about a couple of dozens products together, so if we want to upgrade Kudu to the latest available version, we also have to upgrade everything else, like HDFS (major upgrade from 2.6 to 3.x), Kafka (major upgrade), HBase (major upgrade), etc, and in many cases these upgrades also bring significant changes/deprecations in other components, like Parquet, which means we have to change (and in some cases rewrite) our code that uses Parquet or Kafka, since these products are rapidly evolving, and many times in ways that break compatibility with old versions - in other words, it's a big mess. I apologize for the final rant; I understand that it is not your or Kudu's fault, and I don't know if there's an easy solution to this conundrum within the constraints of a vendor supported approach, but for us it makes zero-maintenance cloud solutions attractive, at the cost of sacrificing the flexibility and "customizability" of a in-house solution. Franco > On March 30, 2020 at 2:22 PM Andrew Wong <aw...@cloudera.com> wrote: > > > > > Alternatively, if the servers in question are under constant > memory > > pressure and receive a fair number of updates, they may be > > prioritizing flushing of inserted rows at the expense of updates, > > causing the tablets to retain a great number of WAL segments > > (containing older updates) for durability's sake. > > > > > > Just an FYI in case it helps confirm or rule it out, this refers to > KUDU-3002 https://issues.apache.org/jira/browse/KUDU-3002 , which will be > fixed in the upcoming release. Can you determine whether your tablet servers > are under memory pressure? > > On Mon, Mar 30, 2020 at 11:17 AM Adar Lieber-Dembo < a...@cloudera.com > mailto:a...@cloudera.com > wrote: > > > > > - the number of open files in the Kudu process in the tablet > servers has increased to now more than 150,000 (as counted using 'lsof'); we > raised the limit of maximum number of open files twice already to avoid a > crash, but we (and our vendor) are concerned that something might not be > right with such a high number of open files. > > > > Using lsof, can you figure out which files are open? WAL segments? > > Data files? Something else? Given the high WAL usage, I'm guessing > > it's the former and these are actually one and the same problem, but > > would be good to confirm nonetheless. > > > > > - in some of the tablet servers the disk space used by the WALs > > is significantly (and concerningly) higher than in most of the other tablet > > servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on > > each tablet server, and this week was the second time where we saw a tablet > > server almost fill the whole WAL disk. We had to stop and restart the > > tablet server, so its tablets would be migrated to different TS's, and we > > could manually clean up the WALs directory, but this is definitely not > > something we would like to do in the future. We took a look inside the WAL > > directory on that TS before wiping it, and we observed that there were a > > few tablets whose WALs were in excess of 30GB. Another piece of information > > is that the table that the largest of these tablets belong to, receives > > about 15M transactions a day, of which about 25% are new inserts and the > > rest are updates of existing rows. > > > > Sounds like there are at least several tablets with follower > > replicas > > that have fallen behind their leaders and are trying to catch up. In > > these situations, a leader will preserve as many WAL segments as > > necessary in order to catch up the lagging follower replica, at > > least > > until some threshold is reached (at which point the master will > > bring > > a new replica online and the lagging replica will be evicted). These > > calculations are done in terms of the number of WAL segments; in the > > affected tablets, do you recall how many WAL segment files there > > were > > before you deleted the directories? > > > > Alternatively, if the servers in question are under constant memory > > pressure and receive a fair number of updates, they may be > > prioritizing flushing of inserted rows at the expense of updates, > > causing the tablets to retain a great number of WAL segments > > (containing older updates) for durability's sake. If you recall the > > affected tablet IDs, do your logs indicate the nature of the > > background operations performed for those tablets? > > > > Some of these questions can also be answered via Kudu > > metrics.There's > > the ops_behind_leader tablet-level metric, which can tell you how > > far > > behind a replica may be. Unfortunately I can't find a metric for > > average number of WAL segments retained (or a histogram); I thought > > we > > had that, but maybe not. > > > > > > > -- > Andrew Wong >