> - the number of open files in the Kudu process in the tablet servers has 
> increased to now more than 150,000 (as counted using 'lsof'); we raised the 
> limit of maximum number of open files twice already to avoid a crash, but we 
> (and our vendor) are concerned that something might not be right with such a 
> high number of open files.

Using lsof, can you figure out which files are open? WAL segments?
Data files? Something else? Given the high WAL usage, I'm guessing
it's the former and these are actually one and the same problem, but
would be good to confirm nonetheless.

> - in some of the tablet servers the disk space used by the WALs is 
> significantly (and concerningly) higher than in most of the other tablet 
> servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on 
> each tablet server, and this week was the second time where we saw a tablet 
> server almost fill the whole WAL disk. We had to stop and restart the tablet 
> server, so its tablets would be migrated to different TS's, and we could 
> manually clean up the WALs directory, but this is definitely not something we 
> would like to do in the future. We took a look inside the WAL directory on 
> that TS before wiping it, and we observed that there were a few tablets whose 
> WALs were in excess of 30GB. Another piece of information is that the table 
> that the largest of these tablets belong to, receives about 15M transactions 
> a day, of which about 25% are new inserts and the rest are updates of 
> existing rows.

Sounds like there are at least several tablets with follower replicas
that have fallen behind their leaders and are trying to catch up. In
these situations, a leader will preserve as many WAL segments as
necessary in order to catch up the lagging follower replica, at least
until some threshold is reached (at which point the master will bring
a new replica online and the lagging replica will be evicted). These
calculations are done in terms of the number of WAL segments; in the
affected tablets, do you recall how many WAL segment files there were
before you deleted the directories?

Alternatively, if the servers in question are under constant memory
pressure and receive a fair number of updates, they may be
prioritizing flushing of inserted rows at the expense of updates,
causing the tablets to retain a great number of WAL segments
(containing older updates) for durability's sake. If you recall the
affected tablet IDs, do your logs indicate the nature of the
background operations performed for those tablets?

Some of these questions can also be answered via Kudu metrics.There's
the ops_behind_leader tablet-level metric, which can tell you how far
behind a replica may be. Unfortunately I can't find a metric for
average number of WAL segments retained (or a histogram); I thought we
had that, but maybe not.

Reply via email to