[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691372#comment-16691372 ] ASF GitHub Bot commented on FLINK-8714: --- zentol closed pull request #5536: [FLINK-8714][Documentation] Added either charsetName) or "utf-8" value in examples of readTextFile URL: https://github.com/apache/flink/pull/5536 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/docs/dev/api_concepts.md b/docs/dev/api_concepts.md index c4215074683..517d01290fc 100644 --- a/docs/dev/api_concepts.md +++ b/docs/dev/api_concepts.md @@ -112,7 +112,7 @@ a text file as a sequence of lines, you can use: {% highlight java %} final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); -DataStream text = env.readTextFile("file:///path/to/file"); +DataStream text = env.readTextFile("file:///path/to/file", StandardCharsets.UTF_8.name()); {% endhighlight %} This will give you a DataStream on which you can then apply transformations to create new @@ -181,7 +181,7 @@ a text file as a sequence of lines, you can use: {% highlight scala %} val env = StreamExecutionEnvironment.getExecutionEnvironment() -val text: DataStream[String] = env.readTextFile("file:///path/to/file") +val text: DataStream[String] = env.readTextFile("file:///path/to/file", StandardCharsets.UTF_8.name()) {% endhighlight %} This will give you a DataStream on which you can then apply transformations to create new diff --git a/docs/dev/batch/examples.md b/docs/dev/batch/examples.md index a4b282688ee..bbaacb34946 100644 --- a/docs/dev/batch/examples.md +++ b/docs/dev/batch/examples.md @@ -68,7 +68,7 @@ WordCount is the "Hello World" of Big Data processing systems. It computes the f ~~~java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); -DataSet text = env.readTextFile("/path/to/file"); +DataSet text = env.readTextFile("/path/to/file", StandardCharsets.UTF_8.name()); DataSet> counts = // split up the lines in pairs (2-tuples) containing: (word,1) @@ -106,7 +106,7 @@ The {% gh_link /flink-examples/flink-examples-batch/src/main/java/org/apache/fli val env = ExecutionEnvironment.getExecutionEnvironment // get input data -val text = env.readTextFile("/path/to/file") +val text = env.readTextFile("/path/to/file", StandardCharsets.UTF_8.name()) val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } } .map { (_, 1) } diff --git a/docs/dev/batch/index.md b/docs/dev/batch/index.md index f0fab8bab6c..1b513e14415 100644 --- a/docs/dev/batch/index.md +++ b/docs/dev/batch/index.md @@ -809,9 +809,9 @@ shortcut methods on the *ExecutionEnvironment*. File-based: -- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings. +- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line wise and returns them as Strings. -- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line wise and returns them as +- `readTextFileWithValue(path, charsetName)` / `TextValueInputFormat` - Reads files line wise and returns them as StringValues. StringValues are mutable strings. - `readCsvFile(path)` / `CsvInputFormat` - Parses files of comma (or another char) delimited fields. @@ -860,10 +860,10 @@ Generic: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); // read text file from local files system -DataSet localLines = env.readTextFile("file:///path/to/my/textfile"); +DataSet localLines = env.readTextFile("file:///path/to/my/textfile", StandardCharsets.UTF_8.name()); // read text file from a HDFS running at nnHost:nnPort -DataSet hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile"); +DataSet hdfsLines = env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile", StandardCharsets.UTF_8.name()); // read a CSV file with three fields DataSet> csvInput = env.readCsvFile("hdfs:///the/CSV/file") @@ -946,7 +946,7 @@ Configuration parameters = new Configuration(); parameters.setBoolean("recursive.file.enumeration", true); // pass the configuration to the data source -DataSet logs = env.readTextFile("file:///path/with.nested/files") +DataSet logs = env.readTextFile("file:///path/with.nested/files", StandardCharsets.UTF_8.name()) .withParameters(parameters); {% endhighlight %} @@ -962,9 +962,9 @@ shortcut methods on the *ExecutionEnvironment*. File-based: -- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns them as Strings. +- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line wise and returns them as Strings. --
[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691373#comment-16691373 ] ASF GitHub Bot commented on FLINK-8714: --- zentol commented on issue #5536: [FLINK-8714][Documentation] Added either charsetName) or "utf-8" value in examples of readTextFile URL: https://github.com/apache/flink/pull/5536#issuecomment-439807069 I agree with Stephan; will close this PR and update the JIRA accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Suggest new users to use env.readTextFile method with 2 arguments (using the > charset), not to rely on system charset (which varies across environments) > --- > > Key: FLINK-8714 > URL: https://issues.apache.org/jira/browse/FLINK-8714 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Michal Klempa >Priority: Trivial > Labels: easyfix, newbie, patch-available, pull-request-available > > When a newcomer (like me), goes through the docs, there are several places > where examples encourage to read the input data using the > {{env.readTextFile()}} method. > > This method variant does not take a second argument - character set (see > [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).] > This versoin relies (according to Javadoc) on " The file will be read with > the system's default character set. " > > This behavior is also default in Java, like in the > {{java.util.String.getBytes()}} method, where not supplying the charset mean > - use the system locale or the one which JVM was started with (see > [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] > There are two ways to set locale prior to JVM start (-D arguments or set > LC_ALL variable). > > Given this is something a new Flink user may not know about, nor he wants to > spend hours trying to find the environment-related bug (it works on > localhost, but in production the locale is different), I would kindly suggest > a change in documentation: lets migrate examples to use the two-argument > version of {{readTextFile(filePath, charsetName)}}. > > I am open to criticism and suggestions. The listing of {{readTextFile}} I was > able to grep in docs is: > {code:java} > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/libs/storm_compatibility.md:DataStream text = > env.readTextFile(localFilePath); > ./dev/cluster_execution.md: DataSet data = > env.readTextFile("hdfs://path/to/file"); > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:DataSet localLines = > env.readTextFile("file:///path/to/my/textfile"); > ./dev/batch/index.md:DataSet hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile"); > ./dev/batch/index.md:DataSet logs = > env.readTextFile("file:///path/with.nested/files") > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:val localLines = > env.readTextFile("file:///path/to/my/textfile") > ./dev/batch/index.md:val hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile") > ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters) > ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile); > ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile) > ./dev/batch/examples.md:DataSet text = > env.readTextFile("/path/to/file"); > ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file") > ./dev/api_concepts.md:DataStream text = > env.readTextFile("file:///path/to/file"); >
[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392912#comment-16392912 ] ASF GitHub Bot commented on FLINK-8714: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/5536 Sorry, I think the JavaDoc comment that triggered this change was actually incorrect in the first place. By default, the read methods always use "UTF-8" rather than the system default charset, so it is actually not non-deterministic. I would personally vote fix the javadoc and other docs that incorrectly claim this is using the system-dependent charset, and leave the other docs as they are (not explicitly pass the same charset name that is anyways passed, makes it simpler). > Suggest new users to use env.readTextFile method with 2 arguments (using the > charset), not to rely on system charset (which varies across environments) > --- > > Key: FLINK-8714 > URL: https://issues.apache.org/jira/browse/FLINK-8714 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Michal Klempa >Priority: Trivial > Labels: easyfix, newbie, patch-available > > When a newcomer (like me), goes through the docs, there are several places > where examples encourage to read the input data using the > {{env.readTextFile()}} method. > > This method variant does not take a second argument - character set (see > [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).] > This versoin relies (according to Javadoc) on " The file will be read with > the system's default character set. " > > This behavior is also default in Java, like in the > {{java.util.String.getBytes()}} method, where not supplying the charset mean > - use the system locale or the one which JVM was started with (see > [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] > There are two ways to set locale prior to JVM start (-D arguments or set > LC_ALL variable). > > Given this is something a new Flink user may not know about, nor he wants to > spend hours trying to find the environment-related bug (it works on > localhost, but in production the locale is different), I would kindly suggest > a change in documentation: lets migrate examples to use the two-argument > version of {{readTextFile(filePath, charsetName)}}. > > I am open to criticism and suggestions. The listing of {{readTextFile}} I was > able to grep in docs is: > {code:java} > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/libs/storm_compatibility.md:DataStream text = > env.readTextFile(localFilePath); > ./dev/cluster_execution.md: DataSet data = > env.readTextFile("hdfs://path/to/file"); > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:DataSet localLines = > env.readTextFile("file:///path/to/my/textfile"); > ./dev/batch/index.md:DataSet hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile"); > ./dev/batch/index.md:DataSet logs = > env.readTextFile("file:///path/with.nested/files") > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:val localLines = > env.readTextFile("file:///path/to/my/textfile") > ./dev/batch/index.md:val hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile") > ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters) > ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile); > ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile) > ./dev/batch/examples.md:DataSet text = > env.readTextFile("/path/to/file"); > ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file") > ./dev/api_concepts.md:DataStream text = > env.readTextFile("file:///path/to/file"); > ./dev/api_concepts.md:val text:
[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374341#comment-16374341 ] ASF GitHub Bot commented on FLINK-8714: --- Github user michalklempa commented on the issue: https://github.com/apache/flink/pull/5536 @zentol Thanks, done. > Suggest new users to use env.readTextFile method with 2 arguments (using the > charset), not to rely on system charset (which varies across environments) > --- > > Key: FLINK-8714 > URL: https://issues.apache.org/jira/browse/FLINK-8714 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Michal Klempa >Priority: Trivial > Labels: easyfix, newbie, patch-available > > When a newcomer (like me), goes through the docs, there are several places > where examples encourage to read the input data using the > {{env.readTextFile()}} method. > > This method variant does not take a second argument - character set (see > [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).] > This versoin relies (according to Javadoc) on " The file will be read with > the system's default character set. " > > This behavior is also default in Java, like in the > {{java.util.String.getBytes()}} method, where not supplying the charset mean > - use the system locale or the one which JVM was started with (see > [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] > There are two ways to set locale prior to JVM start (-D arguments or set > LC_ALL variable). > > Given this is something a new Flink user may not know about, nor he wants to > spend hours trying to find the environment-related bug (it works on > localhost, but in production the locale is different), I would kindly suggest > a change in documentation: lets migrate examples to use the two-argument > version of {{readTextFile(filePath, charsetName)}}. > > I am open to criticism and suggestions. The listing of {{readTextFile}} I was > able to grep in docs is: > {code:java} > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/libs/storm_compatibility.md:DataStream text = > env.readTextFile(localFilePath); > ./dev/cluster_execution.md: DataSet data = > env.readTextFile("hdfs://path/to/file"); > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:DataSet localLines = > env.readTextFile("file:///path/to/my/textfile"); > ./dev/batch/index.md:DataSet hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile"); > ./dev/batch/index.md:DataSet logs = > env.readTextFile("file:///path/with.nested/files") > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:val localLines = > env.readTextFile("file:///path/to/my/textfile") > ./dev/batch/index.md:val hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile") > ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters) > ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile); > ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile) > ./dev/batch/examples.md:DataSet text = > env.readTextFile("/path/to/file"); > ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file") > ./dev/api_concepts.md:DataStream text = > env.readTextFile("file:///path/to/file"); > ./dev/api_concepts.md:val text: DataStream[String] = > env.readTextFile("file:///path/to/file") > ./dev/local_execution.md: DataSet data = > env.readTextFile("file:///path/to/file"); > ./ops/deployment/aws.md:env.readTextFile("s3:///");{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371074#comment-16371074 ] ASF GitHub Bot commented on FLINK-8714: --- Github user michalklempa commented on the issue: https://github.com/apache/flink/pull/5536 Unrelated test https://travis-ci.org/apache/flink/jobs/343906340#L5636 failing. > Suggest new users to use env.readTextFile method with 2 arguments (using the > charset), not to rely on system charset (which varies across environments) > --- > > Key: FLINK-8714 > URL: https://issues.apache.org/jira/browse/FLINK-8714 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Michal Klempa >Priority: Trivial > Labels: easyfix, newbie > > When a newcomer (like me), goes through the docs, there are several places > where examples encourage to read the input data using the > {{env.readTextFile()}} method. > > This method variant does not take a second argument - character set (see > [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).] > This versoin relies (according to Javadoc) on " The file will be read with > the system's default character set. " > > This behavior is also default in Java, like in the > {{java.util.String.getBytes()}} method, where not supplying the charset mean > - use the system locale or the one which JVM was started with (see > [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] > There are two ways to set locale prior to JVM start (-D arguments or set > LC_ALL variable). > > Given this is something a new Flink user may not know about, nor he wants to > spend hours trying to find the environment-related bug (it works on > localhost, but in production the locale is different), I would kindly suggest > a change in documentation: lets migrate examples to use the two-argument > version of {{readTextFile(filePath, charsetName)}}. > > I am open to criticism and suggestions. The listing of {{readTextFile}} I was > able to grep in docs is: > {code:java} > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files > that respect the `TextInputFormat` specification, line-by-line and returns > them as Strings. > ./dev/libs/storm_compatibility.md:DataStream text = > env.readTextFile(localFilePath); > ./dev/cluster_execution.md: DataSet data = > env.readTextFile("hdfs://path/to/file"); > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:DataSet localLines = > env.readTextFile("file:///path/to/my/textfile"); > ./dev/batch/index.md:DataSet hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile"); > ./dev/batch/index.md:DataSet logs = > env.readTextFile("file:///path/with.nested/files") > ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files > line wise and returns them as Strings. > ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` > - Reads files line wise and returns them as > ./dev/batch/index.md:val localLines = > env.readTextFile("file:///path/to/my/textfile") > ./dev/batch/index.md:val hdfsLines = > env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile") > ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters) > ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile); > ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile) > ./dev/batch/examples.md:DataSet text = > env.readTextFile("/path/to/file"); > ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file") > ./dev/api_concepts.md:DataStream text = > env.readTextFile("file:///path/to/file"); > ./dev/api_concepts.md:val text: DataStream[String] = > env.readTextFile("file:///path/to/file") > ./dev/local_execution.md: DataSet data = > env.readTextFile("file:///path/to/file"); > ./ops/deployment/aws.md:env.readTextFile("s3:///");{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)
[ https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370075#comment-16370075 ] ASF GitHub Bot commented on FLINK-8714: --- GitHub user michalklempa opened a pull request: https://github.com/apache/flink/pull/5536 [FLINK-8714][Documentation] Added either charsetName) or "utf-8" value in examples of readTextFile ## What is the purpose of the change When a newcomer (like me), goes through the docs, there are several places where examples encourage to read the input data using the env.readTextFile() method. This method variant does not take a second argument - character set (see https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-). This versoin relies (according to Javadoc) on " The file will be read with the system's default character set. " *(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)* Fixing this in documentation by providing charsetName in examples where the API is described and "utf-8" as second argument in programming examples. This should help others not to forget about the need to specify a charset programmatically, if they want to avoid non-deterministic behavior depending on environment. ## Brief change log ## Verifying this change This change is a trivial rework of documentation without any test coverage. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable You can merge this pull request into a Git repository by running: $ git pull https://github.com/michalklempa/flink FLINK-8714_readTextFile_charset_version Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5536.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5536 commit 221684da5b564b21c1e0cc99e823c18939c0ca91 Author: Michal KlempaDate: 2018-02-20T13:50:30Z FLINK-8714 added either env.readTextFile(pathToFile, charsetName) where the API is described or readTextFile(path/to/file, utf-8) where API is shown as example > Suggest new users to use env.readTextFile method with 2 arguments (using the > charset), not to rely on system charset (which varies across environments) > --- > > Key: FLINK-8714 > URL: https://issues.apache.org/jira/browse/FLINK-8714 > Project: Flink > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.4.0 >Reporter: Michal Klempa >Priority: Trivial > Labels: easyfix, newbie > > When a newcomer (like me), goes through the docs, there are several places > where examples encourage to read the input data using the > {{env.readTextFile()}} method. > > This method variant does not take a second argument - character set (see > [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).] > This versoin relies (according to Javadoc) on " The file will be read with > the system's default character set. " > > This behavior is also default in Java, like in the > {{java.util.String.getBytes()}} method, where not supplying the charset mean > - use the system locale or the one which JVM was started with (see > [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] > There are two ways to set locale prior to JVM start (-D arguments or set > LC_ALL variable). > > Given this is something a new Flink user may not know about, nor he wants to > spend hours trying to find the environment-related bug (it works on > localhost, but in production the locale is different), I would kindly suggest > a change in documentation: lets migrate examples to use the two-argument >