[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257146#comment-16257146 ] Aljoscha Krettek commented on FLINK-7266: - Fixed on master (to be 1.5) in b00f1b326c1ab4221a555200a4d5798e1565b821 > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Aljoscha Krettek >Priority: Blocker > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256920#comment-16256920 ] Aljoscha Krettek commented on FLINK-7266: - We actually have to fix this for 1.4 because we regress from 1.3, otherwise. The operation that deletes parent directories is too expensive and will basically DDOS s3, making Flink unusable for bigger installations with s3. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.4.0, 1.3.2, 1.5.0 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201823#comment-16201823 ] Aljoscha Krettek commented on FLINK-7266: - For 1.4, this will be resolved by only deleting the parent directory on the master (in the checkpoint coordinator). > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Blocker > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194915#comment-16194915 ] Stephan Ewen commented on FLINK-7266: - True, this is a problem in 1.3.2 - the tradeoff was to either have a very large amount of redundant requests for directory emptiness check (which cause the checkpointing to stall or be throttled) or to leave the "directories". In Flink 1.4 we want to fix this by letting the checkpoints understand the file structure and make it a single call to drop the directory, as Steve suggested. The current abstraction is overly generic (just things in arbitrary byte chunks) and does not understand that checkpoint files cluster together in directories. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168733#comment-16168733 ] Elias Levy commented on FLINK-7266: --- I am curious what the state of this is. It is still a problem on 1.3.2, making use of S3 with the file system state backend very imprudent in production. You end up with thousands of empty "directories" in S3 for the checkpoints {code} $ $ sudo aws s3 ls --recursive s3://bucket/flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/ 2017-09-15 23:03:15 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-1/ 2017-09-15 23:04:15 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-10/ 2017-09-15 23:14:07 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-100/ 2017-09-15 23:14:14 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-101/ 2017-09-15 23:14:20 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-102/ 2017-09-15 23:15:12 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-103/ 2017-09-15 23:15:18 0 flink/checkpoints/58c7604fbc543b6df75b62601a9b4c9d/chk-104/ ... {code} > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155245#comment-16155245 ] Steve Loughran commented on FLINK-7266: --- if you are using s3a then delete(path, recursive=false) will stop you from trying to delete a non-empty dir > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155151#comment-16155151 ] Aljoscha Krettek commented on FLINK-7266: - I think that's only part of the problem because Flink must check on its own whether the directory is empty before we can delete it. The basic problem is that each state handle is being cleaned up individually. If we had global knowledge that all state handles actually reside in on base directory then we could shoot of an asynchronous command that deletes that whole sub-directory. (Which might still be horribly slow on S3 and not solve the problem at all.) > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153910#comment-16153910 ] Steve Loughran commented on FLINK-7266: --- FWIW, in s3a we create a single delete request to rm all parent paths *and don't bother doing the existence check*. That is, for a file a/b/c.txt, after the file is written in close(), POST a delete list of /a/ /a/b It's ~O(1) for depth and as you don't need to wait for the response, even something you could being async on. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112606#comment-16112606 ] Aljoscha Krettek commented on FLINK-7266: - This is actually resolved on {{release-1.3}} for the s3 filesystem. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110547#comment-16110547 ] Stefan Richter commented on FLINK-7266: --- I agree that we should not block the release on this. Just wanted to have this recorded with this issue, so that we can improve it for the future. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110535#comment-16110535 ] Stephan Ewen commented on FLINK-7266: - We could try and improve that by not dong the {{mkdirs()}} call in the stream factory for each state element. That might help with that, but I would consider this to not be a release blocker. I would try and solve that in a more holistic way in 1.4.0, by extending the FileSystem abstraction and post-state release hooks in the Checkpoint Coordinator (so that there is one call to drop the directory marker file, if we cannot find a way for it to not be created in the first place. > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7266) Don't attempt to delete parent directory on S3
[ https://issues.apache.org/jira/browse/FLINK-7266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110453#comment-16110453 ] Stefan Richter commented on FLINK-7266: --- One comment about the "not necessary on S3" part: during the release 1.3.2 testing, I observed that I can see some empty directory entries remaining in S3. I added a screenshot in the testing document [here|https://docs.google.com/document/d/1dN9AM9FUPizIu4hTKAXJSbbAORRdrce-BqQ8AUHlOqE/edit?ts=59807985#]. If this is not a problem, can the issue be closed or is the merge into 1.4 still pending? > Don't attempt to delete parent directory on S3 > -- > > Key: FLINK-7266 > URL: https://issues.apache.org/jira/browse/FLINK-7266 > Project: Flink > Issue Type: Bug > Components: Core >Affects Versions: 1.3.1 >Reporter: Stephan Ewen >Assignee: Stephan Ewen >Priority: Critical > Fix For: 1.4.0, 1.3.2 > > > Currently, every attempted release of an S3 state object also checks if the > "parent directory" is empty and then tries to delete it. > Not only is that unnecessary on S3, but it is prohibitively expensive and for > example causes S3 to throttle calls by the JobManager on checkpoint cleanup. > The {{FileState}} must only attempt parent directory cleanup when operating > against real file systems, not when operating against object stores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)