Re: Flink leaves a lot RocksDB sst files in tmp directory

2018-10-12 Thread Stefan Richter
Hi,

Can you maybe show us what is inside of one of the directory instance? 
Furthermore, your TM logs show multiple instances of OutOfMemoryErrors, so that 
might also be a problem. Also how was the job moved? If a TM is killed, of 
course it cannot cleanup. That is why the data goes to tmp dir so that the OS 
can eventually take care of it, in container environments this dir should 
always be cleaned anyways.

Best,
Stefan

> On 11. Oct 2018, at 10:15, Sayat Satybaldiyev  wrote:
> 
> Thank you Piotr for the reply! We didn't run this job on the previous version 
> of Flink. Unfortunately, I don't have a log file from JM only TM logs. 
> 
> https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing
>  
> 
> 
> On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski  > wrote:
> Hi,
> 
> Was this happening in older Flink version? Could you post in what 
> circumstances the job has been moved to a new TM (full job manager logs and 
> task manager logs would be helpful)? I’m suspecting that those leftover files 
> might have something to do with local recovery.
> 
> Piotrek 
> 
>> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev > > wrote:
>> 
>> After digging more in the log, I think it's more a bug. I've greped a log by 
>> job id and found under normal circumstances TM supposed to delete flink-io 
>> files. For some reason, it doesn't delete files that were listed above.
>> 
>> 2018-10-08 22:10:25,865 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
>> 2018-10-08 22:10:25,867 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
>> 2018-10-08 22:10:25,874 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
>> 2018-10-08 22:17:38,680 INFO  
>> org.apache.flink.runtime.taskexecutor.TaskExecutor- Close 
>> JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,686 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,691 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 
>> 
>> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev > > wrote:
>> Dear all,
>> 
>> While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint 
>> FS, I've noticed that after a job has moved to a different host it leaves 
>> quite a huge state in temp folder(1.2TB in total). The files are not used as 
>> TM is not running a job on the current host. 
>> 
>> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but 
>> then it was moved to a different TM. I'm wondering is it intended behavior 
>> or a possible bug?
>> 
>> I've attached files that are left and not used by a job in PrintScreen.
> 



Re: Flink leaves a lot RocksDB sst files in tmp directory

2018-10-11 Thread Sayat Satybaldiyev
Thank you Piotr for the reply! We didn't run this job on the previous
version of Flink. Unfortunately, I don't have a log file from JM only TM
logs.

https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing

On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski 
wrote:

> Hi,
>
> Was this happening in older Flink version? Could you post in what
> circumstances the job has been moved to a new TM (full job manager logs and
> task manager logs would be helpful)? I’m suspecting that those leftover
> files might have something to do with local recovery.
>
> Piotrek
>
> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev  wrote:
>
> After digging more in the log, I think it's more a bug. I've greped a log
> by job id and found under normal circumstances TM supposed to delete
> flink-io files. For some reason, it doesn't delete files that were listed
> above.
>
> 2018-10-08 22:10:25,865 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
> 2018-10-08 22:10:25,867 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
> 2018-10-08 22:10:25,874 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
> 2018-10-08 22:17:38,680 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Close
> JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,686 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,691 INFO
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
> Deleting existing instance base directory
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>
>
> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev 
> wrote:
>
>> Dear all,
>>
>> While running Flink 1.6.1 with RocksDB as a backend and hdfs as
>> checkpoint FS, I've noticed that after a job has moved to a different host
>> it leaves quite a huge state in temp folder(1.2TB in total). The files are
>> not used as TM is not running a job on the current host.
>>
>> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but
>> then it was moved to a different TM. I'm wondering is it intended
>> behavior or a possible bug?
>>
>> I've attached files that are left and not used by a job in PrintScreen.
>>
>
>


Re: Flink leaves a lot RocksDB sst files in tmp directory

2018-10-10 Thread Piotr Nowojski
Hi,

Was this happening in older Flink version? Could you post in what circumstances 
the job has been moved to a new TM (full job manager logs and task manager logs 
would be helpful)? I’m suspecting that those leftover files might have 
something to do with local recovery.

Piotrek 

> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev  wrote:
> 
> After digging more in the log, I think it's more a bug. I've greped a log by 
> job id and found under normal circumstances TM supposed to delete flink-io 
> files. For some reason, it doesn't delete files that were listed above.
> 
> 2018-10-08 22:10:25,865 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting 
> existing instance base directory 
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
> 2018-10-08 22:10:25,867 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting 
> existing instance base directory 
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
> 2018-10-08 22:10:25,874 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting 
> existing instance base directory 
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
> 2018-10-08 22:17:38,680 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor- Close 
> JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,686 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting 
> existing instance base directory 
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 2018-10-08 22:17:38,691 INFO  
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - Deleting 
> existing instance base directory 
> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> org.apache.flink.util.FlinkException: JobManager responsible for 
> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
> 
> 
> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev  > wrote:
> Dear all,
> 
> While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint 
> FS, I've noticed that after a job has moved to a different host it leaves 
> quite a huge state in temp folder(1.2TB in total). The files are not used as 
> TM is not running a job on the current host. 
> 
> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but 
> then it was moved to a different TM. I'm wondering is it intended behavior or 
> a possible bug?
> 
> I've attached files that are left and not used by a job in PrintScreen.



Re: Flink leaves a lot RocksDB sst files in tmp directory

2018-10-09 Thread Sayat Satybaldiyev
After digging more in the log, I think it's more a bug. I've greped a log
by job id and found under normal circumstances TM supposed to delete
flink-io files. For some reason, it doesn't delete files that were listed
above.

2018-10-08 22:10:25,865 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
2018-10-08 22:10:25,867 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
2018-10-08 22:10:25,874 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
2018-10-08 22:17:38,680 INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor- Close
JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,686 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
2018-10-08 22:17:38,691 INFO
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  -
Deleting existing instance base directory
/tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.
org.apache.flink.util.FlinkException: JobManager responsible for
a5b223c7aee89845f9aed24012e46b7e lost the leadership.


On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev  wrote:

> Dear all,
>
> While running Flink 1.6.1 with RocksDB as a backend and hdfs as
> checkpoint FS, I've noticed that after a job has moved to a different host
> it leaves quite a huge state in temp folder(1.2TB in total). The files are
> not used as TM is not running a job on the current host.
>
> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but
> then it was moved to a different TM. I'm wondering is it intended
> behavior or a possible bug?
>
> I've attached files that are left and not used by a job in PrintScreen.
>