Re: Regarding BucketingSink

2018-04-21 Thread Aljoscha Krettek
I would expect that to be possible as well, yes. > On 21. Apr 2018, at 17:33, Vishal Santoshi wrote: > > >> After the savepoint state has been written, the sink might start new > >> .in-progress files. These files are not part of the savepoint but renamed > >> to .pending in close(). > >> On r

Re: Regarding BucketingSink

2018-04-21 Thread Vishal Santoshi
>> After the savepoint state has been written, the sink might start new .in-progress files. These files are not part of the savepoint but renamed to .pending in close(). >> On restore all pending files that are part of the savepoint are moved into final state (and possibly truncated). See handlePen

Re: Regarding BucketingSink

2018-02-21 Thread Vishal Santoshi
Thank you Fabian, What is more important ( and I think you might have addressed it in your post so sorry for being a little obtuse ) is that deleting them does not violate "at-least-once" delivery. And if that is a definite than we are fine with it, though we will test it further. Thanks and

Re: Regarding BucketingSink

2018-02-21 Thread Fabian Hueske
Hi Vishal, hi Mu, After the savepoint state has been written, the sink might start new .in-progress files. These files are not part of the savepoint but renamed to .pending in close(). On restore all pending files that are part of the savepoint are moved into final state (and possibly truncated).

Re: Regarding BucketingSink

2018-02-20 Thread Mu Kong
Hi Aljoscha, Thanks for confirming that fact that Flink doesn't clean up pending files. Is that safe to clean(remove) all the pending files after cancel(w/ or w/o savepointing) or failover? If we do that, will we lose some data? Thanks! Best, Mu On Wed, Feb 21, 2018 at 5:27 AM, Vishal Santos

Re: Regarding BucketingSink

2018-02-20 Thread Vishal Santoshi
Sorry, but just wanted to confirm that the assertion "at-least-once" delivery true if there is a dangling pending file ? On Mon, Feb 19, 2018 at 2:21 PM, Vishal Santoshi wrote: > That is fine, till flink assure at-least-once semantics ? > > If the contents of a .pending file, through the turbu

Re: Regarding BucketingSink

2018-02-19 Thread Vishal Santoshi
That is fine, till flink assure at-least-once semantics ? If the contents of a .pending file, through the turbulence ( restarts etc ) are assured to be in another file than anything starting with "_" underscore will by default ignored by hadoop ( hive or MR etc ). On Mon, Feb 19, 2018 at 11:03

Re: Regarding BucketingSink

2018-02-19 Thread Aljoscha Krettek
Hi, Sorry for the confusion. The framework (Flink) does currently not do any cleanup of pending files, yes. Best, Aljoscha > On 19. Feb 2018, at 17:01, Vishal Santoshi wrote: > > >> You should only have these dangling pending files after a failure-recovery > >> cycle, as you noticed. My sugg

Re: Regarding BucketingSink

2018-02-19 Thread Vishal Santoshi
>> You should only have these dangling pending files after a failure-recovery cycle, as you noticed. My suggestion would be to periodically clean up older pending files. A little confused. Is that what the framework should do, or us as part of some cleanup job ? On Mon, Feb 19, 2018 at 10:47

Re: Regarding BucketingSink

2018-02-19 Thread Aljoscha Krettek
Hi, The BucketingSink does not clean up pending files on purpose. In a distributed setting, and especially with rescaling of Flink operators, it is sufficiently hard to figure out which of the pending files you actually can delete and which of them you have to leave because they will get moved

Re: Regarding BucketingSink

2018-02-19 Thread Till Rohrmann
Hi Vishal, what pending files should indeed get eventually finalized. This happens on a checkpoint complete notification. Thus, what you report seems not right. Maybe Aljoscha can shed a bit more light into the problem. In order to further debug the problem, it would be really helpful to get acce

Re: Regarding BucketingSink

2018-02-15 Thread Mu Kong
Hi Vishal, I have the same concern about save pointing in BucketingSink. As for your question, I think before the pending files get cleared in handleRestoredBucketState . They are finalized in notifyCheckpointComplete https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-fi

Re: Regarding BucketingSink

2018-02-14 Thread Vishal Santoshi
-rw-r--r-- 3 root hadoop 11 2018-02-14 18:48 /kraken/sessions_uuid_csid_v3/dt=2018-02-14/_part-0-18.valid-length -rw-r--r-- 3 root hadoop 54053518 2018-02-14 19:15 /kraken/sessions_uuid_csid_v3/dt=2018-02-14/_part-0-19.pending -rw-r--r-- 3 root hadoop 11 2018-02-14 21:17 /

Re: Regarding BucketingSink

2018-02-09 Thread Vishal Santoshi
without --allowNonRestoredState, on a suspend/resume we do see the length file along with the finalized file ( finalized during resume ) -rw-r--r-- 3 root hadoop 10 2018-02-09 13:57 /vishal/sessionscid/dt=2018-02-09/_part-0-28.valid-length that does makes much more sense. I guess we sh

Re: Regarding BucketingSink

2018-02-09 Thread Vishal Santoshi
This is 1.4 BTW. I am not sure that I am reading this correctly but the lifecycle of cancel/resume is 2 steps 1. Cancel job with SP closeCurrentPartFile https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors

Regarding BucketingSink

2018-02-09 Thread Vishal Santoshi
What should be the behavior of BucketingSink vis a vis state ( pending , inprogess and finalization ) when we suspend and resume ? So I did this * I had a pipe writing to hdfs suspend and resume using --allowNonRestoredState as in I had added a harmless MapOperator ( stateless ). * I see that