At work in our Production environment we have a Kudu cluster with 55 tablet 
servers running Kudu version 1.7.0.
The cluster has 263 tables, 17,620 tablets (x3 to take into account the 
replicas), with about 58 TB of data. The output from the command 'kudu cluster 
ksck' shows that all the tables are healthy.


In the last few months we have been seeing a couple of concerning phenomena:

- the number of open files in the Kudu process in the tablet servers has 
increased to now more than 150,000 (as counted using 'lsof'); we raised the 
limit of maximum number of open files twice already to avoid a crash, but we 
(and our vendor) are concerned that something might not be right with such a 
high number of open files.

- in some of the tablet servers the disk space used by the WALs is 
significantly (and concerningly) higher than in most of the other tablet 
servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on each 
tablet server, and this week was the second time where we saw a tablet server 
almost fill the whole WAL disk. We had to stop and restart the tablet server, 
so its tablets would be migrated to different TS's, and we could manually clean 
up the WALs directory, but this is definitely not something we would like to do 
in the future. We took a look inside the WAL directory on that TS before wiping 
it, and we observed that there were a few tablets whose WALs were in excess of 
30GB. Another piece of information is that the table that the largest of these 
tablets belong to, receives about 15M transactions a day, of which about 25% 
are new inserts and the rest are updates of existing rows.

We created a couple of support cases with our vendor, and they are currently 
reviewing the logs, but we also thought it would be useful to post this in the 
Kudu users mailing list, in case someone has ideas of what could cause this 
behavior and how to address it, and to find out if anyone else here has noticed 
something similar on their Kudu clusters, or if it is just peculiar to our 
configuration and type of load.

Thanks in advance,
Franco Venturi

Reply via email to