Hi,

Most likely the issue happened because of high number of tablet replicas at
the tablet server.  In case of high spike of in the input data rate, higher
compaction activity might require more than usual number of file
descriptors, since more files are opened.

How many tablet replicas does that tablet server have?  It's not
recommended to have too many:
https://kudu.apache.org/docs/known_issues.html#_scale

To understand what has happened, you need to take a look into the logs of
the tablet server.  This might be useful:
https://kudu.apache.org/docs/troubleshooting.html

Overall, if there is only one (?) tablet server in the whole Kudu cluster,
why to have 39 partitions per table?  I guess that's some sort of
proof-of-concept/toy setup, but anyways.  Since all the tablet replicas end
up at the same single tablet server, I don't see benefits from partitioning
in that setup.  For the tablet server, it simply means x-times increased
number of open file descriptors and increased memory usage.


Kind regards,

Alexey

On Fri, Oct 4, 2019 at 4:21 AM Faraz Mateen <fmat...@an10.io> wrote:

> Hi all,
>
> I am facing a problem with my kudu setup where tablet server crashes with
> "too many open files" error.
> The setup consists of a single master and a single tablet server. Tables
> created are such that there are 39 partitions per table. However not all
> partitions have data that corresponds to them.
> Yesterday my tserver crashed and when I am trying to restart the tserver,
> it fails with the error:
>
> I1004 03:50:39.896301  5669 ts_tablet_manager.cc:1173] T
> cab85f15f06748d0b59161d9f3da55f7 P ee14d248ac994d0eb60dbb0db4ab3b09:
> Registered tablet (data state: TABLET_DATA_READY)
> W1004 03:50:39.923184  5687 os-util.cc:165] could not read
> /proc/self/status: IO error: /proc/self/status: Too many open files (error
> 24)
> I1004 03:50:39.939460  5669 ts_tablet_manager.cc:1173] T
> d8d68ce6f6ea49479c00d29709869f1f P ee14d248ac994d0eb60dbb0db4ab3b09:
> Registered tablet (data state: TABLET_DATA_READY)
>
> I have already modified ulimit of the machine:
>
> root@vm-3:~# ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 63923
> max locked memory       (kbytes, -l) 16384
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 8192
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> *Set up Details:*
> Single master and tserver setup on a single VM.
> 4 cores, 550GB hard disk, 16GB RAM
> Kudu version 1.8 on ubuntu, installed through debian packages.
> Before crash, data was being inserted in kudu at a very high rate. RAM
> usage was around 87% and disk usage was around 84 percent.
>
> Here is what I have tried so far:
> 1- Set ulimit -n to 65535.
> 2- Reboot the vm to get rid of stale processes.
> 3- Set block_manager_max_open_files to 32000 in tserver flag file.
>
> What I want to know now is:
> 1- Why am I hitting this problem? Is this due to low resources on the VM
> or high number of tablets on a single tserver?
> 2- How can I get around this problem, recover my data and kudu services?
>
> Would really appreciate some help on this.
> --
> Faraz Mateen
>

Reply via email to