Hi Paul, As you discovered, Kudu holds WAL segments open until the tablets they belong to are deleted. block_manager_max_open_files won't help here; that just applies to files opened for accessing data blocks, not WAL segments.
As far as WAL segments are concerned, we've previously discussed "queiscing" tablets that haven't been used in some time, which would involve halting their Raft consensus state machine and perhaps closing their WAL segments. I can't find a JIRA for this feature, but I'm also not aware of anyone working on it. If you're interested in contributing to Kudu, this could be a worthwhile avenue for you to explore further. I'm a little fuzzy on the details, but I believe that by default a tablet will retain anywhere from 2 to 10 WAL segments, all of them open. The exact number depends on how "caught up" the replication group is; if one peer is behind, more segments may be retained in order to help that peer catch up in the future. The settings that control these numbers are log_min_segments_to_retain and log_max_segments_to_retain. Out of curiosity, how many tablet replicas did your 334 tables generate in total? You can deduce that by calculating, for each table, the total number of partitions multiplied by the table's replication factor. And across how many tservers were they all distributed? By design, tservers can handle many tablets, but as usual, the implementation lags the design, and at the moment we're recommending no more than 100 tablets per tserver (http://kudu.apache.org/docs/known_issues.html#_other_known_issues). On Thu, Feb 16, 2017 at 8:42 AM, Paul Brannan <[email protected]> wrote: > I wrote a quick script today to see how kudu behaves if I create many > tables. After creating 334 tables, I started getting timeouts. I see this > in the master log file: > > W0216 11:37:48.961221 49810 catalog_manager.cc:2490] CreateTablet RPC for > tablet 9b259d5c5ff74f04820240f2159bc1a0 on TS > faaf4e9b6e5945d7a14953c4cc34f164 (telx-sb-dev2:7050) failed: IO error: > Couldn't create tablet metadata: Failed to write tablet metadata > 9b259d5c5ff74f04820240f2159bc1a0: Call to mkstemp() failed on name template > /var/lib/kudu/tserver/tablet-meta/9b259d5c5ff74f04820240f2159bc1a0.tmp.XXXXXX: > Too many open files (error 24) > > I decreased block_manager_max_open_files, but still got the same result. > Lsof shows that the open files are for the WAL: > > kudu-tser 49648 kudu 1021u REG 8,5 67108864 16385457 > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001 > kudu-tser 49648 kudu 1022r REG 8,5 67108864 16385457 > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/wal-000000001 > kudu-tser 49648 kudu 1023u REG 8,5 24000000 16385458 > /var/lib/kudu/tserver/wals/62b73d1b7f7a4e61a0a30a551e66230b/index.000000000 > > The files do not get closed until the tables are deleted, even though no > running process has any of those tables open. > > Is there a setting that will reduce the number of WAL files that get created > or held open at any given point in time?
