Hi Aljoscha, did you by any chance get around to looking at the problem again? It seems to us that at the time when restoreState() is called, the files are already in final status and there are additional .valid-length files (hence there are neither pending nor in-progress files). Furthermore, one of the part files (11 in the example below) is missing completely. We are able to reproduce this behavior by killing a task manager. Can you make sense of that?
Cheers, Max — Maximilian Bode * Junior Consultant * [email protected] TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 > Am 03.03.2016 um 15:04 schrieb Maximilian Bode <[email protected]>: > > Hi Aljoscha, > > thank you for the fast answer. The files in HDFS change as follows: > > -before task manager is killed: > [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound > -rw-r--r-- 2 user hadoop 2435461 2016-03-03 14:51 > /hdfs/dir/outbound/_part-0-0.in-progress > -rw-r--r-- 2 user hadoop 2404604 2016-03-03 14:51 > /hdfs/dir/outbound/_part-1-0.in-progress > -rw-r--r-- 2 user hadoop 2453796 2016-03-03 14:51 > /hdfs/dir/outbound/_part-10-0.in-progress > -rw-r--r-- 2 user hadoop 2447794 2016-03-03 14:51 > /hdfs/dir/outbound/_part-11-0.in-progress > -rw-r--r-- 2 user hadoop 2453479 2016-03-03 14:51 > /hdfs/dir/outbound/_part-2-0.in-progress > -rw-r--r-- 2 user hadoop 2512413 2016-03-03 14:51 > /hdfs/dir/outbound/_part-3-0.in-progress > -rw-r--r-- 2 user hadoop 2431290 2016-03-03 14:51 > /hdfs/dir/outbound/_part-4-0.in-progress > -rw-r--r-- 2 user hadoop 2457820 2016-03-03 14:51 > /hdfs/dir/outbound/_part-5-0.in-progress > -rw-r--r-- 2 user hadoop 2430218 2016-03-03 14:51 > /hdfs/dir/outbound/_part-6-0.in-progress > -rw-r--r-- 2 user hadoop 2482970 2016-03-03 14:51 > /hdfs/dir/outbound/_part-7-0.in-progress > -rw-r--r-- 2 user hadoop 2452622 2016-03-03 14:51 > /hdfs/dir/outbound/_part-8-0.in-progress > -rw-r--r-- 2 user hadoop 2467733 2016-03-03 14:51 > /hdfs/dir/outbound/_part-9-0.in-progress > > -shortly after task manager is killed: > [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound > -rw-r--r-- 2 user hadoop 12360198 2016-03-03 14:52 > /hdfs/dir/outbound/_part-0-0.pending > -rw-r--r-- 2 user hadoop 9350654 2016-03-03 14:52 > /hdfs/dir/outbound/_part-1-0.in-progress > -rw-r--r-- 2 user hadoop 9389872 2016-03-03 14:52 > /hdfs/dir/outbound/_part-10-0.in-progress > -rw-r--r-- 2 user hadoop 12236015 2016-03-03 14:52 > /hdfs/dir/outbound/_part-11-0.pending > -rw-r--r-- 2 user hadoop 12256483 2016-03-03 14:52 > /hdfs/dir/outbound/_part-2-0.pending > -rw-r--r-- 2 user hadoop 12261730 2016-03-03 14:52 > /hdfs/dir/outbound/_part-3-0.pending > -rw-r--r-- 2 user hadoop 9418913 2016-03-03 14:52 > /hdfs/dir/outbound/_part-4-0.in-progress > -rw-r--r-- 2 user hadoop 12176987 2016-03-03 14:52 > /hdfs/dir/outbound/_part-5-0.pending > -rw-r--r-- 2 user hadoop 12165782 2016-03-03 14:52 > /hdfs/dir/outbound/_part-6-0.pending > -rw-r--r-- 2 user hadoop 9474037 2016-03-03 14:52 > /hdfs/dir/outbound/_part-7-0.in-progress > -rw-r--r-- 2 user hadoop 12136347 2016-03-03 14:52 > /hdfs/dir/outbound/_part-8-0.pending > -rw-r--r-- 2 user hadoop 12305943 2016-03-03 14:52 > /hdfs/dir/outbound/_part-9-0.pending > > -still a bit later: > [user@host user]$ hdfs dfs -ls -R /hdfs/dir/outbound > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-0-0.valid-length > -rw-r--r-- 2 user hadoop 8026667 2016-03-03 14:52 > /hdfs/dir/outbound/_part-0-1.pending > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-1-0.valid-length > -rw-r--r-- 2 user hadoop 8097752 2016-03-03 14:52 > /hdfs/dir/outbound/_part-1-1.pending > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-10-0.valid-length > -rw-r--r-- 2 user hadoop 7109259 2016-03-03 14:52 > /hdfs/dir/outbound/_part-10-1.pending > -rw-r--r-- 2 user hadoop 0 2016-03-03 14:52 > /hdfs/dir/outbound/_part-2-0.valid-length > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-3-0.valid-length > -rw-r--r-- 2 user hadoop 7974655 2016-03-03 14:52 > /hdfs/dir/outbound/_part-3-1.pending > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-4-0.valid-length > -rw-r--r-- 2 user hadoop 5753163 2016-03-03 14:52 > /hdfs/dir/outbound/_part-4-1.pending > -rw-r--r-- 2 user hadoop 0 2016-03-03 14:52 > /hdfs/dir/outbound/_part-5-0.valid-length > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-6-0.valid-length > -rw-r--r-- 2 user hadoop 7904468 2016-03-03 14:52 > /hdfs/dir/outbound/_part-6-1.pending > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-7-0.valid-length > -rw-r--r-- 2 user hadoop 7477004 2016-03-03 14:52 > /hdfs/dir/outbound/_part-7-1.pending > -rw-r--r-- 2 user hadoop 0 2016-03-03 14:52 > /hdfs/dir/outbound/_part-8-0.valid-length > -rw-r--r-- 2 user hadoop 9 2016-03-03 14:52 > /hdfs/dir/outbound/_part-9-0.valid-length > -rw-r--r-- 2 user hadoop 7979307 2016-03-03 14:52 > /hdfs/dir/outbound/_part-9-1.pending > -rw-r--r-- 2 user hadoop 12360198 2016-03-03 14:52 > /hdfs/dir/outbound/part-0-0 > -rw-r--r-- 2 user hadoop 9350654 2016-03-03 14:52 > /hdfs/dir/outbound/part-1-0 > -rw-r--r-- 2 user hadoop 9389872 2016-03-03 14:52 > /hdfs/dir/outbound/part-10-0 > -rw-r--r-- 2 user hadoop 12256483 2016-03-03 14:52 > /hdfs/dir/outbound/part-2-0 > -rw-r--r-- 2 user hadoop 12261730 2016-03-03 14:52 > /hdfs/dir/outbound/part-3-0 > -rw-r--r-- 2 user hadoop 9418913 2016-03-03 14:52 > /hdfs/dir/outbound/part-4-0 > -rw-r--r-- 2 user hadoop 12176987 2016-03-03 14:52 > /hdfs/dir/outbound/part-5-0 > -rw-r--r-- 2 user hadoop 12165782 2016-03-03 14:52 > /hdfs/dir/outbound/part-6-0 > -rw-r--r-- 2 user hadoop 9474037 2016-03-03 14:52 > /hdfs/dir/outbound/part-7-0 > -rw-r--r-- 2 user hadoop 12136347 2016-03-03 14:52 > /hdfs/dir/outbound/part-8-0 > -rw-r--r-- 2 user hadoop 12305943 2016-03-03 14:52 > /hdfs/dir/outbound/part-9-0 > > Can you see from this what is going wrong? > > Cheers, > Max > — > Maximilian Bode * Junior Consultant * [email protected] > <mailto:[email protected]> > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke > Sitz: Unterföhring * Amtsgericht München * HRB 135082 > >> Am 03.03.2016 um 14:50 schrieb Aljoscha Krettek <[email protected] >> <mailto:[email protected]>>: >> >> Hi, >> did you check whether there are any files at your specified HDFS output >> location? If yes, which files are there? >> >> Cheers, >> Aljoscha >>> On 03 Mar 2016, at 14:29, Maximilian Bode <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Just for the sake of completeness: this also happens when killing a task >>> manager and is therefore probably unrelated to job manager HA. >>> >>>> Am 03.03.2016 um 14:17 schrieb Maximilian Bode >>>> <[email protected] <mailto:[email protected]>>: >>>> >>>> Hi everyone, >>>> >>>> unfortunately, I am running into another problem trying to establish >>>> exactly once guarantees (Kafka -> Flink 1.0.0-rc3 -> HDFS). >>>> >>>> When using >>>> >>>> RollingSink<Tuple3<Integer,Integer,String>> sink = new >>>> RollingSink<Tuple3<Integer,Integer,String>>("hdfs://our.machine.com:8020/hdfs/dir/outbound >>>> <hdfs://our.machine.com:8020/hdfs/dir/outbound>"); >>>> sink.setBucketer(new NonRollingBucketer()); >>>> output.addSink(sink); >>>> >>>> and then killing the job manager, the new job manager is unable to restore >>>> the old state throwing >>>> --- >>>> java.lang.Exception: Could not restore checkpointed state to operators and >>>> functions >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:454) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:209) >>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) >>>> at java.lang.Thread.run(Thread.java:744) >>>> Caused by: java.lang.Exception: Failed to restore state to function: >>>> In-Progress file hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 >>>> <hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0> was neither moved >>>> to pending nor is still in progress. >>>> at >>>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:168) >>>> at >>>> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTask.java:446) >>>> ... 3 more >>>> Caused by: java.lang.RuntimeException: In-Progress file >>>> hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 >>>> <hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0> was neither moved >>>> to pending nor is still in progress. >>>> at >>>> org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:686) >>>> at >>>> org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingSink.java:122) >>>> at >>>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restoreState(AbstractUdfStreamOperator.java:165) >>>> ... 4 more >>>> --- >>>> I found a resolved issue [1] concerning Hadoop 2.7.1. We are in fact using >>>> 2.4.0 – might this be the same issue? >>>> >>>> Another thing I could think of is that the job is not configured correctly >>>> and there is some sort of timing issue. The checkpoint interval is 10 >>>> seconds, everything else was left at default value. Then again, as the >>>> NonRollingBucketer is used, there should not be any timing issues, right? >>>> >>>> Cheers, >>>> Max >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-2979 >>>> <https://issues.apache.org/jira/browse/FLINK-2979> >>>> >>>> — >>>> Maximilian Bode * Junior Consultant * [email protected] >>>> <mailto:[email protected]> >>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring >>>> Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke >>>> Sitz: Unterföhring * Amtsgericht München * HRB 135082 >>>> >>> >> >
signature.asc
Description: Message signed with OpenPGP using GPGMail
