Hi,

Can you maybe show us what is inside of one of the directory instance? 
Furthermore, your TM logs show multiple instances of OutOfMemoryErrors, so that 
might also be a problem. Also how was the job moved? If a TM is killed, of 
course it cannot cleanup. That is why the data goes to tmp dir so that the OS 
can eventually take care of it, in container environments this dir should 
always be cleaned anyways.

Best,
Stefan

> On 11. Oct 2018, at 10:15, Sayat Satybaldiyev <saya...@gmail.com> wrote:
> 
> Thank you Piotr for the reply! We didn't run this job on the previous version 
> of Flink. Unfortunately, I don't have a log file from JM only TM logs. 
> 
> https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing
>  
> <https://drive.google.com/file/d/14QSVeS4c0EETT6ibK3m_TMgdLUwD6H1m/view?usp=sharing>
> 
> On Wed, Oct 10, 2018 at 10:08 AM Piotr Nowojski <pi...@data-artisans.com 
> <mailto:pi...@data-artisans.com>> wrote:
> Hi,
> 
> Was this happening in older Flink version? Could you post in what 
> circumstances the job has been moved to a new TM (full job manager logs and 
> task manager logs would be helpful)? I’m suspecting that those leftover files 
> might have something to do with local recovery.
> 
> Piotrek 
> 
>> On 9 Oct 2018, at 15:28, Sayat Satybaldiyev <saya...@gmail.com 
>> <mailto:saya...@gmail.com>> wrote:
>> 
>> After digging more in the log, I think it's more a bug. I've greped a log by 
>> job id and found under normal circumstances TM supposed to delete flink-io 
>> files. For some reason, it doesn't delete files that were listed above.
>> 
>> 2018-10-08 22:10:25,865 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_bf69685b-78d3-431c-88be-b3f26db05566.
>> 2018-10-08 22:10:25,867 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_14630a50145935222dbee3f1bcfdc2a6__1_1__uuid_47cd6e95-144a-4c52-a905-52966a5e9381.
>> 2018-10-08 22:10:25,874 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_7c539a96-a247-4299-b1a0-01df713c3c34.
>> 2018-10-08 22:17:38,680 INFO  
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
>> JobManager connection for job a5b223c7aee89845f9aed24012e46b7e.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,686 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_7185aa35d035b12c70cf490077378540__1_1__uuid_2e88c56a-2fc2-41f2-a1b9-3b0594f660fb.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 2018-10-08 22:17:38,691 INFO  
>> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend  - 
>> Deleting existing instance base directory 
>> /tmp/flink-io-5eb5cae3-b194-40b2-820e-01f8f39b5bf6/job_a5b223c7aee89845f9aed24012e46b7e_op_StreamSink_92266bd138cd7d51ac7a63beeb86d5f5__1_1__uuid_b44aecb7-ba16-4aa4-b709-31dae7f58de9.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> org.apache.flink.util.FlinkException: JobManager responsible for 
>> a5b223c7aee89845f9aed24012e46b7e lost the leadership.
>> 
>> 
>> On Tue, Oct 9, 2018 at 2:33 PM Sayat Satybaldiyev <saya...@gmail.com 
>> <mailto:saya...@gmail.com>> wrote:
>> Dear all,
>> 
>> While running Flink 1.6.1 with RocksDB as a backend and hdfs as checkpoint 
>> FS, I've noticed that after a job has moved to a different host it leaves 
>> quite a huge state in temp folder(1.2TB in total). The files are not used as 
>> TM is not running a job on the current host. 
>> 
>> The job a5b223c7aee89845f9aed24012e46b7e had been running on the host but 
>> then it was moved to a different TM. I'm wondering is it intended behavior 
>> or a possible bug?
>> 
>> I've attached files that are left and not used by a job in PrintScreen.
> 

Reply via email to