[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-06-02 Thread Jingsong Lee (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1733#comment-1733
 ] 

Jingsong Lee commented on FLINK-22472:
--

We may need to do a little bit of surgery on the partition commit trigger.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Assignee: luoyuxia
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-05-25-14-27-40-563.png
>
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-25 Thread luoyuxia (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350850#comment-17350850
 ] 

luoyuxia commented on FLINK-22472:
--

[~ykt836] Yes, with pleasure.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
> Attachments: image-2021-05-25-14-27-40-563.png
>
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-25 Thread Kurt Young (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350842#comment-17350842
 ] 

Kurt Young commented on FLINK-22472:


[~luoyuxia] would you like to take this issue?

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
> Attachments: image-2021-05-25-14-27-40-563.png
>
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-25 Thread luoyuxia (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350817#comment-17350817
 ] 

luoyuxia commented on FLINK-22472:
--

I think this problem can be caused by two reasons:

1:  Although the partition is comittable  according to the partition commit 
policy you configure, there still remains data needed to be written to this 
patition. In this case,  you may need to check your partition commit policy. 

2: Currently, the StreamingFileWriter can't aware of it that the partition is 
comittable and then commits all files in this partition. So although the 
partition has been commited, the files in this parition haven't been commited. 
In this case, we can modify the logic of StreamingFileWriter to alleviate the 
problem.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
> Attachments: image-2021-05-25-14-27-40-563.png
>
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-25 Thread luoyuxia (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17350816#comment-17350816
 ] 

luoyuxia commented on FLINK-22472:
--

I think this problem can be caused by two reasons:

1:  Although the partition is comittable  according to the partition commit 
policy you configure, there still remains data needed to be written to this 
patition. In this case,  you may need to check your partition commit policy. 

2: Currently, the StreamingFileWriter can't aware of it that the partition is 
comittable and then commits all files in this partition. So although the 
partition has been commited, the files in this parition haven't been commited. 
In this case, we can modify the logic of StreamingFileWriter to alleviate the 
problem.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
> Attachments: image-2021-05-25-14-27-40-563.png
>
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-13 Thread forideal (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344277#comment-17344277
 ] 

forideal commented on FLINK-22472:
--

[~Leonard Xu] Thank you for your suggestion and look forward to fixing this 
problem.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-13 Thread Leonard Xu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344273#comment-17344273
 ] 

Leonard Xu commented on FLINK-22472:


Yes, this may lead to your downstream tasks read empty data.  You can set 
proper sink.rolling-policy.rollover-interval and adjust your downstream task 
monitor interval to work around this case temporarily.

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-05-13 Thread forideal (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344262#comment-17344262
 ] 

forideal commented on FLINK-22472:
--

[~Leonard Xu]

If the success file has been submitted and the data has not been submitted, is 
there a scenario of data loss?
My downstream tasks rely on this success file。

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-22472) The real partition data produced time is behind meta(_SUCCESS) file produced

2021-04-26 Thread Leonard Xu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331864#comment-17331864
 ] 

Leonard Xu commented on FLINK-22472:


CC :[~lzljs3620320]

> The real partition data produced time is behind meta(_SUCCESS) file produced
> 
>
> Key: FLINK-22472
> URL: https://issues.apache.org/jira/browse/FLINK-22472
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem, Connectors / Hive
>Reporter: Leonard Xu
>Priority: Major
>
> I test write some data to csv file by flink filesystem connector, but after 
> the success file produced, the data file is still un-committed, it's very 
> weird to me.
> {code:java}
> bang@mac db1.db $ll 
> /var/folders/55/cw682b314gn8jhfh565hp7q0gp/T/junit8642959834366044048/junit484868942580135598/test-partition-time-commit/d\=2020-05-03/e\=12/
> total 8
> drwxr-xr-x  4 bang  staff  128  4 25 19:57 ./
> drwxr-xr-x  8 bang  staff  256  4 25 19:57 ../
> -rw-r--r--  1 bang  staff   12  4 25 19:57 
> .part-b703d4b9-067a-4dfe-935e-3afc723aed56-0-4.inprogress.b7d9cf09-0f72-4dce-8591-b61b1d23ae9b
> -rw-r--r--  1 bang  staff0  4 25 19:57 _MY_SUCCESS
> {code}
>  
> After some debug I found I have to set  {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval parameters, the default value of the 
> two parameters is pretty big(128M and 30min). It's not convenient for 
> test/demo. I think we can improve this.}}
>  
> As the doc[1] described, for row formats (csv, json), you can set the 
> parameter {{sink.rolling-policy.file-size}} or 
> {{sink.rolling-policy.rollover-interval}} in the connector properties and 
> parameter {{execution.checkpointing.interval}} in flink-conf.yaml together if 
> you don’t want to wait a long period before observe the data exists in file 
> system.
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/#rolling-policy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)