date:20180813

[jira] [Commented] (FLINK-8500) Get the timestamp of the Kafka message from kafka consumer(Kafka010Fetcher)

2018-08-13 Thread Fred Teunissen (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579276#comment-16579276
 ] 

Fred Teunissen commented on FLINK-8500:
---

I've rebased this PR with the latest master branch yesterday evening.

> Get the timestamp of the Kafka message from kafka consumer(Kafka010Fetcher)
> ---
>
> Key: FLINK-8500
> URL: https://issues.apache.org/jira/browse/FLINK-8500
> Project: Flink
>  Issue Type: Improvement
>  Components: Kafka Connector
>Affects Versions: 1.4.0
>Reporter: yanxiaobin
>Priority: Major
> Fix For: 1.7.0
>
> Attachments: image-2018-01-30-14-58-58-167.png, 
> image-2018-01-31-10-48-59-633.png
>
>
> The method deserialize of KeyedDeserializationSchema  needs a parameter 
> 'kafka message timestamp' (from ConsumerRecord) .In some business scenarios, 
> this is useful！
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10074) Allowable number of checkpoint failures

2018-08-13 Thread Thomas Weise (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579269#comment-16579269
 ] 

Thomas Weise commented on FLINK-10074:
--

I think configuring the behavior as a count of allowable consecutive failures 
would work well. Would this replace the existing setFailOnCheckpointingErrors 
(will that setting become irrelevant when the user already sets the count)?

[https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/environment/CheckpointConfig.html]

Regarding what happens once the job was allowed to fail and recovers only to 
fail again: Shouldn't the counter only be reset after the next successful 
checkpoint vs. on restart? 

> Allowable number of checkpoint failures 
> 
>
> Key: FLINK-10074
> URL: https://issues.apache.org/jira/browse/FLINK-10074
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing
>Reporter: Thomas Weise
>Assignee: vinoyang
>Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to 
> avoid restarts. If, for example, a transient S3 error prevents checkpoint 
> completion, the next checkpoint may very well succeed. The user may wish to 
> not incur the expense of restart under such scenario and this could be 
> expressed with a failure threshold (number of subsequent checkpoint 
> failures), possibly combined with a list of exceptions to tolerate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread Xiangyu Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579248#comment-16579248
 ] 

Xiangyu Zhu commented on FLINK-10133:
-

[~Wosinsan] [~elevy] I have uploaded the logs with some sensitive info 
modified. If the log looks ok to you then this issue can be closed as 
duplicate. Thanks!

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
> Attachments: client.log, namenode.log, standalonesession.log, 
> zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

2018-08-13 Thread Xiangyu Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyu Zhu updated FLINK-10133:

Attachment: client.log
namenode.log
zookeeper.log
standalonesession.log

> finished job's jobgraph never been cleaned up in zookeeper for standalone 
> clusters (HA mode with multiple masters)
> --
>
> Key: FLINK-10133
> URL: https://issues.apache.org/jira/browse/FLINK-10133
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0, 1.5.2, 1.6.0
>Reporter: Xiangyu Zhu
>Priority: Major
> Attachments: client.log, namenode.log, standalonesession.log, 
> zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as 
> following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is 
> expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
> path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
> However we observe that whenever we submitted a job and it finished (via 
> `bin/flink run WordCount.jar`), the blob data is gone whereas job id node 
> under zookeeper is still there, with a uuid style lock node inside it. From 
> the debug node in zookeeper we observed something like "cannot be deleted 
> because non empty". Because of this, as long as a job is finished and the 
> jobgraph node persists, if restart the clusters or kill one manager (to test 
> HA mode), it tries to recover a finished job and couldn't find blob data 
> under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can 
> be deleted successfully. If the jobgraphs is clean, killing one job manager 
> makes another stand-by JM raised as leader, so it is only this jobgraphs 
> issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens 
> every time for finished job (we only tested with wordcount.jar though). I'm 
> aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens 
> every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[GitHub] TisonKun commented on issue #6339: [FLINK-9859] [runtime] More Akka config

2018-08-13 Thread GitBox

TisonKun commented on issue #6339: [FLINK-9859] [runtime] More Akka config
URL: https://github.com/apache/flink/pull/6339#issuecomment-412743884
 
 
   ping @tillrohrmann  :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (FLINK-9859) More Akka config

2018-08-13 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579207#comment-16579207
 ] 

ASF GitHub Bot commented on FLINK-9859:
---

TisonKun commented on issue #6339: [FLINK-9859] [runtime] More Akka config
URL: https://github.com/apache/flink/pull/6339#issuecomment-412743884
 
 
   ping @tillrohrmann  :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> More Akka config
> 
>
> Key: FLINK-9859
> URL: https://issues.apache.org/jira/browse/FLINK-9859
> Project: Flink
>  Issue Type: Improvement
>  Components: Local Runtime
>Affects Versions: 1.5.1
>Reporter: 陈梓立
>Assignee: 陈梓立
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.5.3
>
>
> Add more akka config options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (FLINK-10135) The JobManager doesn't report the cluster-level metrics

2018-08-13 Thread vinoyang (JIRA)



 [ 
https://issues.apache.org/jira/browse/FLINK-10135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned FLINK-10135:


Assignee: vinoyang

> The JobManager doesn't report the cluster-level metrics
> ---
>
> Key: FLINK-10135
> URL: https://issues.apache.org/jira/browse/FLINK-10135
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.5.0
>Reporter: Joey Echeverria
>Assignee: vinoyang
>Priority: Major
>
> In [the documentation for 
> metrics|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/metrics.html#cluster]
>  in the Flink 1.5.0 release, it says that the following metrics are reported 
> by the JobManager:
> {noformat}
> numRegisteredTaskManagers
> numRunningJobs
> taskSlotsAvailable
> taskSlotsTotal
> {noformat}
> In the job manager REST endpoint 
> ({{http://:8081/jobmanager/metrics}}), those metrics don't 
> appear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-6437) Move history server configuration to a separate file

2018-08-13 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579157#comment-16579157
 ] 

ASF GitHub Bot commented on FLINK-6437:
---

yanghua edited a comment on issue #6542: [FLINK-6437][History Server] Move 
history server configuration to a separate file
URL: https://github.com/apache/flink/pull/6542#issuecomment-412736043
 
 
   hi @StephanEwen @zentol ,
   
   Currently, the history server runs as a single JVM process, which means it 
is a separate component. I think the configuration split is reasonable. If I 
only need to start the history server then I don't rely on other 
configurations, and I don't need to pay attention to other configurations.
   
   Regarding the discussion of compatibility, I basically agree with @zentol  
's point of view.
   
   Basic on the existing implementation, I think we can add more log warnings 
in the new method of `GlobalConfiguration`. The general idea is as follows:
   
   * Judge the new configuration file, if it does not exist, I will add a log 
to inform the user that he is using the old configuration;
   * If the new configuration and the old configuration about history server 
exist at the same time,  then we will warn the user that there are two 
configurations and we will base on the `flink-historyserver-conf.yaml`;
   
   In addition, I will give a log warning when the configuration file is loaded 
by the `HistoryServer#main` method, for example : 
   
   > the current configuration about history server in `flink-conf.yaml` is 
only for backward compatibility, it is recommended that they enable the new 
configuration file `flink-historyserver-conf.yaml`.
   
   In addition, I will comment out the configuration items in the 
`flink-historyserver-conf.yaml`, considering that we have done so many guides, 
I believe users will be more likely to accept the correct guidelines.
   
   Even if it is based on a configuration already in `flink-conf.yaml`, the 
current implementation will still work. But if we say that there are 
configurations in both files, we seem to have no better way than to give a 
warning "based on `flink-historyserver-conf.yaml`".


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Move history server configuration to a separate file
> 
>
> Key: FLINK-6437
> URL: https://issues.apache.org/jira/browse/FLINK-6437
> Project: Flink
>  Issue Type: Improvement
>  Components: History Server
>Affects Versions: 1.3.0
>Reporter: Stephan Ewen
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> I suggest to keep the {{flink-conf.yaml}} leaner by moving configuration of 
> the History Server to a different file.
> In general, I would propose to move configurations of separate, independent 
> and optional components to individual config files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[GitHub] yanghua edited a comment on issue #6542: [FLINK-6437][History Server] Move history server configuration to a separate file

2018-08-13 Thread GitBox

yanghua edited a comment on issue #6542: [FLINK-6437][History Server] Move 
history server configuration to a separate file
URL: https://github.com/apache/flink/pull/6542#issuecomment-412736043
 
 
   hi @StephanEwen @zentol ,
   
   Currently, the history server runs as a single JVM process, which means it 
is a separate component. I think the configuration split is reasonable. If I 
only need to start the history server then I don't rely on other 
configurations, and I don't need to pay attention to other configurations.
   
   Regarding the discussion of compatibility, I basically agree with @zentol  
's point of view.
   
   Basic on the existing implementation, I think we can add more log warnings 
in the new method of `GlobalConfiguration`. The general idea is as follows:
   
   * Judge the new configuration file, if it does not exist, I will add a log 
to inform the user that he is using the old configuration;
   * If the new configuration and the old configuration about history server 
exist at the same time,  then we will warn the user that there are two 
configurations and we will base on the `flink-historyserver-conf.yaml`;
   
   In addition, I will give a log warning when the configuration file is loaded 
by the `HistoryServer#main` method, for example : 
   
   > the current configuration about history server in `flink-conf.yaml` is 
only for backward compatibility, it is recommended that they enable the new 
configuration file `flink-historyserver-conf.yaml`.
   
   In addition, I will comment out the configuration items in the 
`flink-historyserver-conf.yaml`, considering that we have done so many guides, 
I believe users will be more likely to accept the correct guidelines.
   
   Even if it is based on a configuration already in `flink-conf.yaml`, the 
current implementation will still work. But if we say that there are 
configurations in both files, we seem to have no better way than to give a 
warning "based on `flink-historyserver-conf.yaml`".


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (FLINK-6437) Move history server configuration to a separate file

2018-08-13 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-6437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579156#comment-16579156
 ] 

ASF GitHub Bot commented on FLINK-6437:
---

yanghua commented on issue #6542: [FLINK-6437][History Server] Move history 
server configuration to a separate file
URL: https://github.com/apache/flink/pull/6542#issuecomment-412736043
 
 
   hi @StephanEwen @zentol ,
   
   Currently, the history server runs as a standalone JVM process, which means 
it is a separate component. I think the configuration split is reasonable. If I 
only need to start the history server then I don't rely on other 
configurations, and I don't need to pay attention to other configurations.
   
   Regarding the discussion of compatibility, I basically agree with @zentol  
's point of view.
   
   Basic on the existing implementation, I think we can add more log warnings 
in the new method of `GlobalConfiguration`. The general idea is as follows:
   
   * Judge the new configuration file, if it does not exist, I will add a log 
to inform the user that he is using the old configuration;
   * If the new configuration and the old configuration about history server 
exist at the same time,  then we will warn the user that there are two 
configurations and we will base on the `flink-historyserver-conf.yaml`;
   
   In addition, I will give a log warning when the configuration file is loaded 
by the `HistoryServer#main` method, for example : 
   
   > the current configuration about history server in `flink-conf.yaml` is 
only for backward compatibility, it is recommended that they enable the new 
configuration file `flink-historyserver-conf.yaml`.
   
   In addition, I will comment out the configuration items in the 
`flink-historyserver-conf.yaml`, considering that we have done so many guides, 
I believe users will be more likely to accept the correct guidelines.
   
   Even if it is based on a configuration already in `flink-conf.yaml`, the 
current implementation will still work. But if we say that there are 
configurations in both files, we seem to have no better way than to give a 
warning "based on `flink-historyserver-conf.yaml`".


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Move history server configuration to a separate file
> 
>
> Key: FLINK-6437
> URL: https://issues.apache.org/jira/browse/FLINK-6437
> Project: Flink
>  Issue Type: Improvement
>  Components: History Server
>Affects Versions: 1.3.0
>Reporter: Stephan Ewen
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.7.0
>
>
> I suggest to keep the {{flink-conf.yaml}} leaner by moving configuration of 
> the History Server to a different file.
> In general, I would propose to move configurations of separate, independent 
> and optional components to individual config files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[GitHub] yanghua commented on issue #6542: [FLINK-6437][History Server] Move history server configuration to a separate file

2018-08-13 Thread GitBox

yanghua commented on issue #6542: [FLINK-6437][History Server] Move history 
server configuration to a separate file
URL: https://github.com/apache/flink/pull/6542#issuecomment-412736043
 
 
   hi @StephanEwen @zentol ,
   
   Currently, the history server runs as a standalone JVM process, which means 
it is a separate component. I think the configuration split is reasonable. If I 
only need to start the history server then I don't rely on other 
configurations, and I don't need to pay attention to other configurations.
   
   Regarding the discussion of compatibility, I basically agree with @zentol  
's point of view.
   
   Basic on the existing implementation, I think we can add more log warnings 
in the new method of `GlobalConfiguration`. The general idea is as follows:
   
   * Judge the new configuration file, if it does not exist, I will add a log 
to inform the user that he is using the old configuration;
   * If the new configuration and the old configuration about history server 
exist at the same time,  then we will warn the user that there are two 
configurations and we will base on the `flink-historyserver-conf.yaml`;
   
   In addition, I will give a log warning when the configuration file is loaded 
by the `HistoryServer#main` method, for example : 
   
   > the current configuration about history server in `flink-conf.yaml` is 
only for backward compatibility, it is recommended that they enable the new 
configuration file `flink-historyserver-conf.yaml`.
   
   In addition, I will comment out the configuration items in the 
`flink-historyserver-conf.yaml`, considering that we have done so many guides, 
I believe users will be more likely to accept the correct guidelines.
   
   Even if it is based on a configuration already in `flink-conf.yaml`, the 
current implementation will still work. But if we say that there are 
configurations in both files, we seem to have no better way than to give a 
warning "based on `flink-historyserver-conf.yaml`".


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] xccui commented on issue #6535: [FLINK-9977] [table][doc] Refine the SQL/Table built-in function docs

2018-08-13 Thread GitBox

xccui commented on issue #6535: [FLINK-9977] [table][doc] Refine the SQL/Table 
built-in function docs
URL: https://github.com/apache/flink/pull/6535#issuecomment-412717935
 
 
   Thanks for the review, @fhueske. Will merge this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (FLINK-9977) Refine the docs for Table/SQL built-in functions

2018-08-13 Thread ASF GitHub Bot (JIRA)



[ 
https://issues.apache.org/jira/browse/FLINK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579094#comment-16579094
 ] 

ASF GitHub Bot commented on FLINK-9977:
---

xccui commented on issue #6535: [FLINK-9977] [table][doc] Refine the SQL/Table 
built-in function docs
URL: https://github.com/apache/flink/pull/6535#issuecomment-412717935
 
 
   Thanks for the review, @fhueske. Will merge this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refine the docs for Table/SQL built-in functions
> 
>
> Key: FLINK-9977
> URL: https://issues.apache.org/jira/browse/FLINK-9977
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Xingcan Cui
>Assignee: Xingcan Cui
>Priority: Minor
>  Labels: pull-request-available
> Attachments: Java.jpg, SQL.jpg, Scala.jpg
>
>
> There exist some syntax errors or inconsistencies in documents and Scala docs 
> of the Table/SQL built-in functions. This issue aims to make some 
> improvements to them.
> Also, according to FLINK-10103, we should use single quotes to express 
> strings in SQL. For example, CONCAT("AA", "BB", "CC") should be replaced with 
> CONCAT('AA', 'BB', 'CC'). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10135) The JobManager doesn't report the cluster-level metrics

2018-08-13 Thread Joey Echeverria (JIRA)

Joey Echeverria created FLINK-10135:
---

 Summary: The JobManager doesn't report the cluster-level metrics
 Key: FLINK-10135
 URL: https://issues.apache.org/jira/browse/FLINK-10135
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 1.5.0
Reporter: Joey Echeverria


In [the documentation for 
metrics|https://ci.apache.org/projects/flink/flink-docs-release-1.5/monitoring/metrics.html#cluster]
 in the Flink 1.5.0 release, it says that the following metrics are reported by 
the JobManager:
{noformat}
numRegisteredTaskManagers
numRunningJobs
taskSlotsAvailable
taskSlotsTotal
{noformat}

In the job manager REST endpoint 
({{http://:8081/jobmanager/metrics}}), those metrics don't appear.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (FLINK-10134) UTF-16 support for TextInputFormat

2018-08-13 Thread David Dreyfus (JIRA)

David Dreyfus created FLINK-10134:
-

Summary: UTF-16 support for TextInputFormat
Key: FLINK-10134
URL: https://issues.apache.org/jira/browse/FLINK-10134
Project: Flink
Issue Type: Bug
Components: Core
Affects Versions: 1.4.2
Reporter: David Dreyfus

It does not appear that Flink supports a charset encoding of "UTF-16". It
particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM) to
establish whether a UTF-16 file is UTF-16LE or UTF-16BE.
TextInputFormat.setCharset("UTF-16") calls DelimitedInputFormat.setCharset(),
which sets TextInputFormat.charsetName and then modifies the previously set
delimiterString to construct the proper byte string encoding of the the
delimiter. This same charsetName is also used in TextInputFormat.readRecord()
to interpret the bytes read from the file.
There are two problems that this implementation would seem to have when using
UTF-16.
# delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java will
return a Big Endian byte sequence including the Byte Order Mark (BOM). The
actual text file will not contain a BOM at each line ending, so the delimiter
will never be read. Moreover, if the actual byte encoding of the file is Little
Endian, the bytes will be interpreted incorrectly.
# TextInputFormat.readRecord() will not see a BOM each time it decodes a byte
sequence with the String(bytes, offset, numBytes, charset) call. Therefore, it
will assume Big Endian, which may not always be correct. [1]
[https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95]

While there are likely many solutions, I would think that all of them would
have to start by reading the BOM from the file when a Split is opened and then
using that BOM to modify the specified encoding to a BOM specific one when the
caller doesn't specify one, and to overwrite the caller's specification if the
BOM is in conflict with the caller's specification. That is, if the BOM
indicates Little Endian and the caller indicates UTF-16BE, Flink should rewrite
the charsetName as UTF-16LE.
I hope this makes sense and that I haven't been testing incorrectly or
misreading the code.
I've verified the problem on version 1.4.2. I believe the problem exists on all
versions.

1 2 3 4 >

1 - 100 of 317 matches

Mail list logo