Re: pyflink keyed stream checkpoint error

2021-10-14 Thread Dian Fu
Hi Curt, Could you try if it works by reducing python.fn-execution.bundle.size to 1000 or 100? Regards, Dian On Thu, Oct 14, 2021 at 2:47 AM Curt Buechter wrote: > Hi guys, > I'm still running into this problem. I checked the logs, and there is no > evidence that the python process crashed. I

Re: pyflink keyed stream checkpoint error

2021-10-13 Thread Curt Buechter
Hi guys, I'm still running into this problem. I checked the logs, and there is no evidence that the python process crashed. I checked the process IDs and they are still active after the error. No `killed process` messages in /var/log/messages. I don't think it's necessarily related to checkpointin

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Curt Buechter
Guess my last reply didn't go through, so here goes again... Possibly, but I don't think so. Since I submitted this, I have done some more testing. It works fine with file system or memory state backends, but not with rocksdb. I will try again and check the logs, though. I've also tested rocksdb c

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Dian Fu
PS: there are more information about this configuration in https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/dev/python/python_config/#python-fn-execution-bundle-size > 2021年9月24日 上午10:07,Dian Fu 写道: > > I agree with Roman that it seems that the Python process has crashed. > > Be

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Dian Fu
I agree with Roman that it seems that the Python process has crashed. Besides the suggestions from Roman, I guess you could also try to configure the bundle size to smaller value via “python.fn-execution.bundle.size”. Regards, Dian > 2021年9月24日 上午3:48,Roman Khachatryan 写道: > > Hi, > > Is it

Re: pyflink keyed stream checkpoint error

2021-09-23 Thread Roman Khachatryan
Hi, Is it possible that the python process crashed or hung up? (probably performing a snapshot) Could you validate this by checking the OS logs for OOM killer messages or process status? Regards, Roman On Wed, Sep 22, 2021 at 6:30 PM Curt Buechter wrote: > > Hi, > I'm getting an error after ena

pyflink keyed stream checkpoint error

2021-09-22 Thread Curt Buechter
Hi, I'm getting an error after enabling checkpointing in my pyflink application that uses a keyed stream and rocksdb state. Here is the error message: 2021-09-22 16:18:14,408 INFO org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend [] - Closed RocksDB State Backend. Cleaning up Rock

Re: Checkpoint error - "The job has failed"

2021-04-28 Thread Dan Hill
3 not 1.11.1. > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Dan Hill > *Sent:* Tuesday, April 27, 2021 7:50 > *To:* Yun Tang > *Cc:* Robert Metzger ; user > *Subject:* Re: Checkpoint error - "The jo

Re: Checkpoint error - "The job has failed"

2021-04-28 Thread Yun Tang
n Tang Cc: Robert Metzger ; user Subject: Re: Checkpoint error - "The job has failed" Hey Yun and Robert, I'm using Flink v1.11.1. Robert, I'll send you a separate email with the logs. On Mon, Apr 26, 2021 at 12:46 AM Yun Tang mailto:myas...@live.com>> wrote: Hi Dan,

Re: Checkpoint error - "The job has failed"

2021-04-26 Thread Dan Hill
Flink-1.10.3. > > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Robert Metzger > *Sent:* Monday, April 26, 2021 14:46 > *To:* Dan Hill > *Cc:* user > *Subject:* Re: Checkpoint error - "

Re: Checkpoint error - "The job has failed"

2021-04-26 Thread Yun Tang
Hill Cc: user Subject: Re: Checkpoint error - "The job has failed" Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill mailto:quietgol...@gmail.com>>

Re: Checkpoint error - "The job has failed"

2021-04-25 Thread Robert Metzger
Hi Dan, can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using) On Mon, Apr 26, 2021 at 7:20 AM Dan Hill wrote: > My Flink job failed to checkpoint with a "The job has failed" error. The > logs contained no other recent e

Checkpoint error - "The job has failed"

2021-04-25 Thread Dan Hill
My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like there

Re: Re: Re: Checkpoint Error

2021-03-10 Thread Till Rohrmann
og? > > Also, have you enabled concurrent checkpoint? > > Best, > Yun > > > --Original Mail -- > *Sender:*Navneeth Krishnan > *Send Date:*Mon Mar 8 13:10:46 2021 > *Recipients:*Yun Gao > *CC:*user > *Subject:*Re: Re: Checkpoint

Re: Re: Re: Checkpoint Error

2021-03-08 Thread Yun Gao
:46 2021 Recipients:Yun Gao CC:user Subject:Re: Re: Checkpoint Error Hi Yun, Thanks for the response. I checked the mounts and only the JM's and TM's are mounted with this EFS. Not sure how to debug this. Thanks On Sun, Mar 7, 2021 at 8:29 PM Yun Gao wrote: Hi Navneeth, It seem

Re: Re: Checkpoint Error

2021-03-07 Thread Navneeth Krishnan
*Navneeth Krishnan > *Send Date:*Sun Mar 7 15:44:59 2021 > *Recipients:*user > *Subject:*Re: Checkpoint Error > >> Hi All, >> >> Any suggestions? >> >> Thanks >> >> On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan < >> reachnavnee.

Re: Re: Checkpoint Error

2021-03-07 Thread Yun Gao
--Original Mail -- Sender:Navneeth Krishnan Send Date:Sun Mar 7 15:44:59 2021 Recipients:user Subject:Re: Checkpoint Error Hi All, Any suggestions? Thanks On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan wrote: Hi All, We are running our streaming job on flink 1.7.2 and we are

Re: Checkpoint Error

2021-03-06 Thread Navneeth Krishnan
Hi All, Any suggestions? Thanks On Mon, Jan 18, 2021 at 7:38 PM Navneeth Krishnan wrote: > Hi All, > > We are running our streaming job on flink 1.7.2 and we are noticing the > below error. Not sure what's causing it, any pointers would help. We have > 10 TM's checkpointing to AWS EFS. > > Asy

Checkpoint Error

2021-01-18 Thread Navneeth Krishnan
Hi All, We are running our streaming job on flink 1.7.2 and we are noticing the below error. Not sure what's causing it, any pointers would help. We have 10 TM's checkpointing to AWS EFS. AsynchronousException{java.lang.Exception: Could not materialize checkpoint 11 for operator Processor -> Sink

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-24 Thread Robert Metzger
Thanks for opening the ticket. I've asked a committer who knows the streaming sink well to take a look at the ticket. On Fri, Apr 24, 2020 at 6:47 AM Lu Niu wrote: > Hi, Robert > > BTW, I did some field study and I think it's possible to support streaming > sink using presto s3 filesystem. I thi

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-23 Thread Lu Niu
Hi, Robert BTW, I did some field study and I think it's possible to support streaming sink using presto s3 filesystem. I think that would help user to use presto s3 fs in all access to s3. I created this jira ticket https://issues.apache.org/jira/browse/FLINK-17364 . what do you think? Best Lu O

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-21 Thread Lu Niu
Cool, thanks! On Tue, Apr 21, 2020 at 4:51 AM Robert Metzger wrote: > I'm not aware of anything. I think the presto s3 file system is generally > the recommended S3 FS implementation. > > On Mon, Apr 13, 2020 at 11:46 PM Lu Niu wrote: > >> Thank you both. Given the debug overhead, I might just

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-21 Thread Robert Metzger
I'm not aware of anything. I think the presto s3 file system is generally the recommended S3 FS implementation. On Mon, Apr 13, 2020 at 11:46 PM Lu Niu wrote: > Thank you both. Given the debug overhead, I might just try out presto s3 > file system then. Besides that presto s3 file system doesn't

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-13 Thread Lu Niu
Thank you both. Given the debug overhead, I might just try out presto s3 file system then. Besides that presto s3 file system doesn't support streaming sink, is there anything else I need to keep in mind? Thanks! Best Lu On Thu, Apr 9, 2020 at 12:29 AM Robert Metzger wrote: > Hey, > Others have

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-09 Thread Robert Metzger
Hey, Others have experienced this as well, yes: https://lists.apache.org/thread.html/5cfb48b36e2aa2b91b2102398ddf561877c28fdbabfdb59313965f0a%40%3Cuser.flink.apache.org%3EDiskErrorException I have also notified the Hadoop project about this issue: https://issues.apache.org/jira/browse/HADOOP-15915

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-08 Thread Congxian Qiu
Hi LU I'm not familiar with S3 file system, maybe others in Flink community can help you in this case, or maybe you can also reach out to s3 teams/community for help. Best, Congxian Lu Niu 于2020年4月8日周三 上午11:05写道: > Hi, Congxiao > > Thanks for replying. yeah, I also found those references. How

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-07 Thread Lu Niu
Hi, Congxiao Thanks for replying. yeah, I also found those references. However, as I mentioned in original post, there is enough capacity in all disk. Also, when I switch to presto file system, the problem goes away. Wondering whether others encounter similar issue. Best Lu On Tue, Apr 7, 2020 a

Re: Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-07 Thread Congxian Qiu
Hi >From the stack, seems the problem is that "org.apache.flink.fs.shaded. hadoop3.org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for s3ablock-0001-", and I googled the exception, found there is some relative page[1], could you please make sure there

Checkpoint Error Because "Could not find any valid local directory for s3ablock-0001"

2020-04-07 Thread Lu Niu
Hi, flink users Did anyone encounter such error? The error comes from S3AFileSystem. But there is no capacity issue on any disk. we are using hadoop 2.7.1. ``` Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not open output stream for state backend at java.u

Re: When I use flink 1.9.1 and produce data to Kafka 1.1.1, The streamTask checkpoint error .

2020-01-15 Thread Yun Tang
Hi The root cause is checkpoint error due to fail to send data to kafka during 'preCommit'. The right solution is avoid to send data to kafka unsuccessfully which might be scope of Kafka. If you cannot ensure the status of kafka with its client and no request for exactly once, yo

Re: When I use flink 1.9.1 and produce data to Kafka 1.1.1, The streamTask checkpoint error .

2020-01-15 Thread jose farfan
Hi I have the same issue. BR Jose On Thu, 9 Jan 2020 at 10:28, ouywl wrote: > Hi all: > When I use flink 1.9.1 and produce data to Kafka 1.1.1. the error was > happen as* log-1,code is::* > > input.addSink( > new FlinkKafkaProducer( > parameterTool.getRequired("bootstra

When I use flink 1.9.1 and produce data to Kafka 1.1.1, The streamTask checkpoint error .

2020-01-09 Thread ouywl
Hi all: When I use flink 1.9.1 and produce data to Kafka 1.1.1. the error was happen as log-1,code is::input.addSink(new FlinkKafkaProducer(parameterTool.getRequired("bootstrap.servers"),parameterTool.getRequired("output-topic"),

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-06 Thread Hao Sun
Thanks for the tip! I did change the jobGraph this time. Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Thu, Dec 6, 2018 at 2:47 AM Till Rohrmann wrote: > Hi Hao, > > if Flink tries to recover from a checkpoint, then the JobGraph should not > be modified and the system should

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-06 Thread Till Rohrmann
Hi Hao, if Flink tries to recover from a checkpoint, then the JobGraph should not be modified and the system should be able to restore the state. Have you changed the JobGraph and are you now trying to recover from the latest checkpoint which is stored in ZooKeeper? If so, then you can also start

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-05 Thread Hao Sun
Till, Flink is automatically trying to recover from a checkpoint not savepoint. How can I get allowNonRestoredState applied in this case? Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Wed, Dec 5, 2018 at 10:09 AM Till Rohrmann wrote: > Hi Hao, > > I think you need to provide

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-05 Thread Till Rohrmann
Hi Hao, I think you need to provide a savepoint file via --fromSavepoint to resume from in order to specify --allowNonRestoredState. Otherwise this option will be ignored because it only works if you resume from a savepoint. Cheers, Till On Wed, Dec 5, 2018 at 12:29 AM Hao Sun wrote: > I am us

Flink 1.7 job cluster (restore from checkpoint error)

2018-12-04 Thread Hao Sun
I am using 1.7 and job cluster on k8s. Here is how I start my job docker-entrypoint.sh job-cluster -j com.zendesk.fraud_prevention.examples.ConnectedStreams --allowNonRestoredState *Seems like --allowNonRestoredState is not honored* === Logs === java","line":"1041","message":"Restoring

Re: Checkpoint Error in flink with Rockdb state backend

2016-05-29 Thread Aljoscha Krettek
Ah yes, if you used a local filesystem for backups this certainly was the source of the problem. On Sun, 29 May 2016 at 17:57 arpit srivastava wrote: > I think the problem was that i was using local filesystem in a cluster. > Now I have switched to hdfs. > > Thanks, > Arpit > > On Sun, May 29, 2

Re: Checkpoint Error in flink with Rockdb state backend

2016-05-29 Thread arpit srivastava
I think the problem was that i was using local filesystem in a cluster. Now I have switched to hdfs. Thanks, Arpit On Sun, May 29, 2016 at 12:57 PM, Aljoscha Krettek wrote: > Hi, > could you please provide the code of your user function that has the > Checkpointed interface and is keeping state

Re: Checkpoint Error in flink with Rockdb state backend

2016-05-29 Thread Aljoscha Krettek
Hi, could you please provide the code of your user function that has the Checkpointed interface and is keeping state? This might give people a chance of understanding what is going on. Cheers, Aljoscha On Sat, 28 May 2016 at 20:55 arpit srivastava wrote: > Hi, > > I am using Flink on yarn clust

Checkpoint Error in flink with Rockdb state backend

2016-05-28 Thread arpit srivastava
Hi, I am using Flink on yarn cluster. My job was running for 2-3 days. After that it failed with two errors org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Error at remote task manager 'ip-xx.xx.xx.xxx'. at org.apache.flink.runtime.io.network.netty.PartitionR