Hi, I think you could try to set the limit for the number of open files to unlimited and see how it goes when you start tablet server.
I think the best way forward is to add tablet servers into the cluster. Ideally, you want to have your data replicated, consider creating tables with replication factor 3 and having at least 4 tablet servers in your cluster. Once you added a new tablet servers, don't forget to run the rebalancer tool (kudu cluster rebalance ...) HTH, Alexey On Mon, Oct 7, 2019 at 2:31 AM Faraz Mateen <fmat...@an10.io> wrote: > Alexey, > > Thank you for the response. Having too many partitions is exactly what the > problem is. When I restart the tserver, it tries to open files against each > tablet and eventually crashes. > > Is there a way to get around this and recover my data? Is there any config > I can change to run the tserver? Or can I add a new tablet server and > migrate existing tablets? > > On Sat, Oct 5, 2019 at 10:05 PM Alexey Serbin <aser...@cloudera.com> > wrote: > >> Hi, >> >> Most likely the issue happened because of high number of tablet replicas >> at the tablet server. In case of high spike of in the input data rate, >> higher compaction activity might require more than usual number of file >> descriptors, since more files are opened. >> >> How many tablet replicas does that tablet server have? It's not >> recommended to have too many: >> https://kudu.apache.org/docs/known_issues.html#_scale >> >> To understand what has happened, you need to take a look into the logs of >> the tablet server. This might be useful: >> https://kudu.apache.org/docs/troubleshooting.html >> >> Overall, if there is only one (?) tablet server in the whole Kudu >> cluster, why to have 39 partitions per table? I guess that's some sort of >> proof-of-concept/toy setup, but anyways. Since all the tablet replicas end >> up at the same single tablet server, I don't see benefits from partitioning >> in that setup. For the tablet server, it simply means x-times increased >> number of open file descriptors and increased memory usage. >> >> >> Kind regards, >> >> Alexey >> >> On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen <fmat...@an10.io> wrote: >> >>> Hi all, >>> >>> I am facing a problem with my kudu setup where tablet server crashes >>> with "too many open files" error. >>> The setup consists of a single master and a single tablet server. Tables >>> created are such that there are 39 partitions per table. However not all >>> partitions have data that corresponds to them. >>> Yesterday my tserver crashed and when I am trying to restart the >>> tserver, it fails with the error: >>> >>> I1004 03:50:39.896301 5669 ts_tablet_manager.cc:1173] T >>> cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb0db4ab3b09: >>> Registered tablet (data state: TABLET_DATA_READY) >>> W1004 03:50:39.923184 5687 os-util.cc:165] could not read >>> /proc/self/status: IO error: /proc/self/status: Too many open files (error >>> 24) >>> I1004 03:50:39.939460 5669 ts_tablet_manager.cc:1173] T >>> d8d68ce6f6ea49479c00d29709869f1f P ee14d248ac994d0eb60dbb0db4ab3b09: >>> Registered tablet (data state: TABLET_DATA_READY) >>> >>> I have already modified ulimit of the machine: >>> >>> root@vm-3:~# ulimit -a >>> core file size (blocks, -c) 0 >>> data seg size (kbytes, -d) unlimited >>> scheduling priority (-e) 0 >>> file size (blocks, -f) unlimited >>> pending signals (-i) 63923 >>> max locked memory (kbytes, -l) 16384 >>> max memory size (kbytes, -m) unlimited >>> open files (-n) 65535 >>> pipe size (512 bytes, -p) 8 >>> POSIX message queues (bytes, -q) 819200 >>> real-time priority (-r) 0 >>> stack size (kbytes, -s) 8192 >>> cpu time (seconds, -t) unlimited >>> max user processes (-u) 65535 >>> virtual memory (kbytes, -v) unlimited >>> file locks (-x) unlimited >>> >>> *Set up Details:* >>> Single master and tserver setup on a single VM. >>> 4 cores, 550GB hard disk, 16GB RAM >>> Kudu version 1.8 on ubuntu, installed through debian packages. >>> Before crash, data was being inserted in kudu at a very high rate. RAM >>> usage was around 87% and disk usage was around 84 percent. >>> >>> Here is what I have tried so far: >>> 1- Set ulimit -n to 65535. >>> 2- Reboot the vm to get rid of stale processes. >>> 3- Set block_manager_max_open_files to 32000 in tserver flag file. >>> >>> What I want to know now is: >>> 1- Why am I hitting this problem? Is this due to low resources on the VM >>> or high number of tablets on a single tserver? >>> 2- How can I get around this problem, recover my data and kudu services? >>> >>> Would really appreciate some help on this. >>> -- >>> Faraz Mateen >>> >> > > -- > Faraz Mateen >