Hi Stephan,
I guess this is the case. Our cluster is a bit overloaded network-wise,
so sometime a Task Manager got disconnected, which causes the restart of
the entire job,
leading to multiple segfaults in other task managers, prolonging recovery.
We're upgrading the network, hopefully the
Hi,
I would assume that those segfaults are only observed *after* a job is already
in the process of canceling? This is a known problem, but currently „accepted“
behaviour after discussions with Stephan and Aljoscha (in CC). From that
discussion, the background is that the native RocksDB