[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336651#comment-17336651 ] Flink Jira Bot commented on FLINK-8794: --- This issue was labeled "stale-major" 7 ago and has not received any updates so it is being deprioritized. If this ticket is actually Major, please raise the priority and ask a committer to assign you the issue or revive the public discussion. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > Labels: stale-major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328657#comment-17328657 ] Flink Jira Bot commented on FLINK-8794: --- This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > Labels: stale-major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417174#comment-16417174 ] yanxiaobin commented on FLINK-8794: --- Other problems have nothing to do with Flink. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417140#comment-16417140 ] Steve Loughran commented on FLINK-8794: --- bq. Enable consistent-view can cause other problems. really? what are they? > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16416734#comment-16416734 ] yanxiaobin commented on FLINK-8794: --- thanks! indeed so. Enable consistent-view can cause other problems. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415491#comment-16415491 ] Steve Loughran commented on FLINK-8794: --- That's amazon EMR's problem. Switch to their "consistent s3" offering for the bucket you are using as the sink > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415131#comment-16415131 ] yanxiaobin commented on FLINK-8794: --- The underlying implementation is com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem. I am using hadoop 2.7.3.I haven't thought of a good solution for the time being. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414272#comment-16414272 ] Steve Loughran commented on FLINK-8794: --- {quote} writing to local disks would decrease performance, since you would need to write the same data twice (first locally then copy remotely {quote} I don't know what FS connector you are using, but these days S3A defaults to buffering blocks to local HDD before initiating upload in the close() or after the block size threshold is reached. You aren't going to see a perf hit if you are writing files smaller than fs.s3a.blocksize. If bigger, afraid so, but it may be worth it. The staging S3A committers coming in Hadoop 3.1 postpone all uploads until task commit, but they gain better failure semantics and job commit is fast and trivial > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16414262#comment-16414262 ] Steve Loughran commented on FLINK-8794: --- it does if you turn s3guard on with Hadoop 3.0+ and its S3A connector, as it (like amazon's EMRFS) uses dynamodb for that consistency. Unless you write code on the explicit assumption that the store is eventually consistent, treating S3 "just" like a filesystem is dangerous. It'll usually work most of the time in tests, but at larger scale production deployments you can get burned. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404584#comment-16404584 ] yanxiaobin commented on FLINK-8794: --- S3 follows the *eventually consistent* principle, so this problem S3 has no good solution for the time being. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395286#comment-16395286 ] yanxiaobin commented on FLINK-8794: --- I will open a case with AWS for the root cause of the problem. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394856#comment-16394856 ] Piotr Nowojski commented on FLINK-8794: --- Thanks for the update and good to hear that. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393956#comment-16393956 ] yanxiaobin commented on FLINK-8794: --- About : 1.What I described above is that there will be such a situation when there is no failure in this job. I found through log that filesystem's rename method has been executed without any exception, but the filename hasn't changed, so I think it should be S3's problem. This should not be a problem with Flink. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Improvement > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384954#comment-16384954 ] yanxiaobin commented on FLINK-8794: --- yes. If we allow for different directory that should be already enough.So how do we solve this problem? Adn when using BucketingSink, it happens that one of the files is always in the [.in-progress] state. And this state has never changed after that. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383441#comment-16383441 ] Piotr Nowojski commented on FLINK-8794: --- The temporary data is already separated from the final output - it's in different files. If we allow for different directory that should be already enough. Besides, writing to local disks would decrease performance, since you would need to write the same data twice (first locally then copy remotely, which is unnecessary, while moving files between directories is cheap) and stil "pending" files would have to be copied to remote location, since in some cases "pending" files are committed during recovery. Thus it wouldn't solve your problem. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383356#comment-16383356 ] yanxiaobin commented on FLINK-8794: --- hi, [~pnowojski] ! Thank you for your suggestion! The downstream processor can ignore the files with "*pending" or "*in-progress" sufixes and "_" prefix, but I don't think it's a good way to deal with it. We can change this behaviour/add an option for BucketingSink to use temporary "in-progress" and "pending" directories instead of prefixes, but the temporary "in-progress" and "pending" directories is still also a subdirectory of the base directory, and the downstream processor may still read the base directory recursively, It also results in reading redundant dirty data. I think the temporary data produced during the program should be isolated from the final output data. Thanks! Also [~kkl0u] could you elaborate why rescaling forced us to keep lingering files? > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383295#comment-16383295 ] Piotr Nowojski commented on FLINK-8794: --- [~Backlight] 3.What kind of downstream processor? Can not it just ignore the files with "*pending" or "*in-progress" sufixes and "_" prefix? I also don't think this is a bug, but designed feature ( https://issues.apache.org/jira/browse/FLINK-5054 ) of the BucketingSink. On the other hand, we could change this behaviour/add an option for BucketingSink to use temporary "in-progress" and "pending" directories instead of prefixes. Also [~kkl0u] could you elaborate why rescaling forced us to keep lingering files? > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383209#comment-16383209 ] yanxiaobin commented on FLINK-8794: --- hi, [~aljoscha] Thank you for your reply! There are the following points: 1.What I described above is that there will be such a situation when there is no failure in this job. 2.This happens when a job has a failure(because one of the taskmanager nodes downtime) and recovery. Fault tolerance of a node in distributed computing is necessary.Because this is a problem in this case. 3.When recovery, the previous in-progress and pending files are not cleared,this causes the downstream processor to read excess dirty data. 5.I think we should first place data in computing nodes' local files, then upload them to the distributed file system after the local file is written completely, for example, S3, HDFS. We are blocked of the problem at the moment. and because of this problem, we can't use this job. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8794) When using BucketingSink, it happens that one of the files is always in the [.in-progress] state
[ https://issues.apache.org/jira/browse/FLINK-8794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381987#comment-16381987 ] Aljoscha Krettek commented on FLINK-8794: - Did you have a failure and recovery in this job? In those cases it can happen that there are lingering {{in-progress}} files because we cannot decide on whether we can clean them up or not when restoring. > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state > > > Key: FLINK-8794 > URL: https://issues.apache.org/jira/browse/FLINK-8794 > Project: Flink > Issue Type: Bug > Components: filesystem-connector >Affects Versions: 1.4.0, 1.4.1 >Reporter: yanxiaobin >Priority: Major > > When using BucketingSink, it happens that one of the files is always in the > [.in-progress] state. And this state has never changed after that. The > underlying use of S3 as storage. > > {code:java} > // code placeholder > {code} > 2018-02-28 11:58:42 147341619 {color:#d04437}_part-28-0.in-progress{color} > 2018-02-28 12:06:27 147315059 part-0-0 > 2018-02-28 12:06:27 147462359 part-1-0 > 2018-02-28 12:06:27 147316006 part-10-0 > 2018-02-28 12:06:28 147349854 part-100-0 > 2018-02-28 12:06:27 147421625 part-101-0 > 2018-02-28 12:06:27 147443830 part-102-0 > 2018-02-28 12:06:27 147372801 part-103-0 > 2018-02-28 12:06:27 147343670 part-104-0 > .. -- This message was sent by Atlassian JIRA (v7.6.3#76005)