Re: Checkpoints issue and job failing

2020-01-06 Thread Navneeth Krishnan
Thanks Vino & Piotr,

sure, will upgrade the flink version and monitor it to see if the problem
still exist.

Thanks

On Mon, Jan 6, 2020 at 12:39 AM Piotr Nowojski  wrote:

> Hi,
>
> From the top of my head I don’t remember anything particular, however
> release 1.4.0 came with quite a lot of deep change which had it’s fair
> share number of bugs, that were subsequently fixed in later releases.
>
> Because 1.4.x tree is no longer supported I would strongly recommend to
> first upgrade to a more recent Flink version. If that’s not possible, I
> would at least upgrade to the latest release from 1.4.x tree (1.4.2).
>
> Piotrek
>
> On 6 Jan 2020, at 07:25, vino yang  wrote:
>
> Hi Navneeth,
>
> Since the file still exists, this exception is very strange.
>
> I want to ask, does it happen by accident or frequently?
>
> Another concern is that since the 1.4 version is very far away, all
> maintenance and response are not as timely as the recent versions. I
> personally recommend upgrading as soon as possible.
>
> I can ping @Piotr Nowojski   and see if it is
> possible to explain the cause of this problem.
>
> Best,
> Vino
>
> Navneeth Krishnan  于2020年1月4日周六 上午1:03写道:
>
>> Thanks Congxian & Vino.
>>
>> Yes, the file do exist and I don't see any problem in accessing it.
>>
>> Regarding flink 1.9, we haven't migrated yet but we are planning to do.
>> Since we have to test it might take sometime.
>>
>> Thanks
>>
>> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu 
>> wrote:
>>
>>> Hi
>>>
>>> Do you have ever check that this problem exists on Flink 1.9?
>>>
>>> Best,
>>> Congxian
>>>
>>>
>>> vino yang  于2020年1月3日周五 下午3:54写道:
>>>
 Hi Navneeth,

 Did you check if the path contains in the exception is really can not
 be found?

 Best,
 Vino

 Navneeth Krishnan  于2020年1月3日周五 上午8:23写道:

> Hi All,
>
> We are running into checkpoint timeout issue more frequently in
> production and we also see the below exception. We are running flink 1.4.0
> and the checkpoints are saved on NFS. Can someone suggest how to overcome
> this?
>
> 
>
> java.lang.IllegalStateException: Could not initialize operator state 
> backend.
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
>  (No such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at java.io.FileInputStream.(FileInputStream.java:138)
>   at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>
>
> Thanks
>
>
>


Re: Checkpoints issue and job failing

2020-01-06 Thread Piotr Nowojski
Hi,

From the top of my head I don’t remember anything particular, however release 
1.4.0 came with quite a lot of deep change which had it’s fair share number of 
bugs, that were subsequently fixed in later releases. 

Because 1.4.x tree is no longer supported I would strongly recommend to first 
upgrade to a more recent Flink version. If that’s not possible, I would at 
least upgrade to the latest release from 1.4.x tree (1.4.2).

Piotrek

> On 6 Jan 2020, at 07:25, vino yang  wrote:
> 
> Hi Navneeth,
> 
> Since the file still exists, this exception is very strange.
> 
> I want to ask, does it happen by accident or frequently?
> 
> Another concern is that since the 1.4 version is very far away, all 
> maintenance and response are not as timely as the recent versions. I 
> personally recommend upgrading as soon as possible.
> 
> I can ping @Piotr Nowojski   and see if it is 
> possible to explain the cause of this problem.
> 
> Best,
> Vino
> 
> Navneeth Krishnan  > 于2020年1月4日周六 上午1:03写道:
> Thanks Congxian & Vino.
> 
> Yes, the file do exist and I don't see any problem in accessing it.
> 
> Regarding flink 1.9, we haven't migrated yet but we are planning to do. Since 
> we have to test it might take sometime.
> 
> Thanks
> 
> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu  > wrote:
> Hi
> 
> Do you have ever check that this problem exists on Flink 1.9?
> 
> Best,
> Congxian
> 
> 
> vino yang mailto:yanghua1...@gmail.com>> 于2020年1月3日周五 
> 下午3:54写道:
> Hi Navneeth,
> 
> Did you check if the path contains in the exception is really can not be 
> found?
> 
> Best,
> Vino
> 
> Navneeth Krishnan  > 于2020年1月3日周五 上午8:23写道:
> Hi All,
> 
> We are running into checkpoint timeout issue more frequently in production 
> and we also see the below exception. We are running flink 1.4.0 and the 
> checkpoints are saved on NFS. Can someone suggest how to overcome this? 
> 
> 
> 
> java.lang.IllegalStateException: Could not initialize operator state backend.
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
>  (No such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at java.io.FileInputStream.(FileInputStream.java:138)
>   at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
> 
> Thanks



Re: Checkpoints issue and job failing

2020-01-05 Thread vino yang
Hi Navneeth,

Since the file still exists, this exception is very strange.

I want to ask, does it happen by accident or frequently?

Another concern is that since the 1.4 version is very far away, all
maintenance and response are not as timely as the recent versions. I
personally recommend upgrading as soon as possible.

I can ping @Piotr Nowojski   and see if it is possible
to explain the cause of this problem.

Best,
Vino

Navneeth Krishnan  于2020年1月4日周六 上午1:03写道:

> Thanks Congxian & Vino.
>
> Yes, the file do exist and I don't see any problem in accessing it.
>
> Regarding flink 1.9, we haven't migrated yet but we are planning to do.
> Since we have to test it might take sometime.
>
> Thanks
>
> On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu 
> wrote:
>
>> Hi
>>
>> Do you have ever check that this problem exists on Flink 1.9?
>>
>> Best,
>> Congxian
>>
>>
>> vino yang  于2020年1月3日周五 下午3:54写道:
>>
>>> Hi Navneeth,
>>>
>>> Did you check if the path contains in the exception is really can not be
>>> found?
>>>
>>> Best,
>>> Vino
>>>
>>> Navneeth Krishnan  于2020年1月3日周五 上午8:23写道:
>>>
 Hi All,

 We are running into checkpoint timeout issue more frequently in
 production and we also see the below exception. We are running flink 1.4.0
 and the checkpoints are saved on NFS. Can someone suggest how to overcome
 this?

 [image: image.png]

 java.lang.IllegalStateException: Could not initialize operator state 
 backend.
at 
 org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
at 
 org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
at 
 org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
at 
 org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
at 
 org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
 Caused by: java.io.FileNotFoundException: 
 /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
  (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
 org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)


 Thanks




Re: Checkpoints issue and job failing

2020-01-03 Thread Navneeth Krishnan
Thanks Congxian & Vino.

Yes, the file do exist and I don't see any problem in accessing it.

Regarding flink 1.9, we haven't migrated yet but we are planning to do.
Since we have to test it might take sometime.

Thanks

On Fri, Jan 3, 2020 at 2:14 AM Congxian Qiu  wrote:

> Hi
>
> Do you have ever check that this problem exists on Flink 1.9?
>
> Best,
> Congxian
>
>
> vino yang  于2020年1月3日周五 下午3:54写道:
>
>> Hi Navneeth,
>>
>> Did you check if the path contains in the exception is really can not be
>> found?
>>
>> Best,
>> Vino
>>
>> Navneeth Krishnan  于2020年1月3日周五 上午8:23写道:
>>
>>> Hi All,
>>>
>>> We are running into checkpoint timeout issue more frequently in
>>> production and we also see the below exception. We are running flink 1.4.0
>>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>>> this?
>>>
>>> [image: image.png]
>>>
>>> java.lang.IllegalStateException: Could not initialize operator state 
>>> backend.
>>> at 
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>>> at 
>>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>>> at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>>> at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>>> at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.io.FileNotFoundException: 
>>> /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
>>>  (No such file or directory)
>>> at java.io.FileInputStream.open0(Native Method)
>>> at java.io.FileInputStream.open(FileInputStream.java:195)
>>> at java.io.FileInputStream.(FileInputStream.java:138)
>>> at 
>>> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>>>
>>>
>>> Thanks
>>>
>>>


Re: Checkpoints issue and job failing

2020-01-03 Thread Congxian Qiu
Hi

Do you have ever check that this problem exists on Flink 1.9?

Best,
Congxian


vino yang  于2020年1月3日周五 下午3:54写道:

> Hi Navneeth,
>
> Did you check if the path contains in the exception is really can not be
> found?
>
> Best,
> Vino
>
> Navneeth Krishnan  于2020年1月3日周五 上午8:23写道:
>
>> Hi All,
>>
>> We are running into checkpoint timeout issue more frequently in
>> production and we also see the below exception. We are running flink 1.4.0
>> and the checkpoints are saved on NFS. Can someone suggest how to overcome
>> this?
>>
>> [image: image.png]
>>
>> java.lang.IllegalStateException: Could not initialize operator state backend.
>>  at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>>  at 
>> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>>  at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>>  at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>>  at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>>  at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.io.FileNotFoundException: 
>> /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
>>  (No such file or directory)
>>  at java.io.FileInputStream.open0(Native Method)
>>  at java.io.FileInputStream.open(FileInputStream.java:195)
>>  at java.io.FileInputStream.(FileInputStream.java:138)
>>  at 
>> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>>
>>
>> Thanks
>>
>>


Re: Checkpoints issue and job failing

2020-01-02 Thread vino yang
Hi Navneeth,

Did you check if the path contains in the exception is really can not be
found?

Best,
Vino

Navneeth Krishnan  于2020年1月3日周五 上午8:23写道:

> Hi All,
>
> We are running into checkpoint timeout issue more frequently in production
> and we also see the below exception. We are running flink 1.4.0 and the
> checkpoints are saved on NFS. Can someone suggest how to overcome this?
>
> [image: image.png]
>
> java.lang.IllegalStateException: Could not initialize operator state backend.
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
>   at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
>  (No such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at java.io.FileInputStream.(FileInputStream.java:138)
>   at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>
>
> Thanks
>
>


Checkpoints issue and job failing

2020-01-02 Thread Navneeth Krishnan
Hi All,

We are running into checkpoint timeout issue more frequently in production
and we also see the below exception. We are running flink 1.4.0 and the
checkpoints are saved on NFS. Can someone suggest how to overcome this?

[image: image.png]

java.lang.IllegalStateException: Could not initialize operator state backend.
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initOperatorState(AbstractStreamOperator.java:302)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:249)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:692)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:679)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException:
/mnt/checkpoints/02c4f8d5c11921f363b98c5959cc4f06/chk-101/e71d8eaf-ff4a-4783-92bd-77e3d8978e01
(No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)


Thanks