[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-11-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691372#comment-16691372
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

zentol closed pull request #5536: [FLINK-8714][Documentation] Added either 
charsetName) or "utf-8" value in examples of readTextFile
URL: https://github.com/apache/flink/pull/5536
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docs/dev/api_concepts.md b/docs/dev/api_concepts.md
index c4215074683..517d01290fc 100644
--- a/docs/dev/api_concepts.md
+++ b/docs/dev/api_concepts.md
@@ -112,7 +112,7 @@ a text file as a sequence of lines, you can use:
 {% highlight java %}
 final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
 
-DataStream text = env.readTextFile("file:///path/to/file");
+DataStream text = env.readTextFile("file:///path/to/file", 
StandardCharsets.UTF_8.name());
 {% endhighlight %}
 
 This will give you a DataStream on which you can then apply transformations to 
create new
@@ -181,7 +181,7 @@ a text file as a sequence of lines, you can use:
 {% highlight scala %}
 val env = StreamExecutionEnvironment.getExecutionEnvironment()
 
-val text: DataStream[String] = env.readTextFile("file:///path/to/file")
+val text: DataStream[String] = env.readTextFile("file:///path/to/file", 
StandardCharsets.UTF_8.name())
 {% endhighlight %}
 
 This will give you a DataStream on which you can then apply transformations to 
create new
diff --git a/docs/dev/batch/examples.md b/docs/dev/batch/examples.md
index a4b282688ee..bbaacb34946 100644
--- a/docs/dev/batch/examples.md
+++ b/docs/dev/batch/examples.md
@@ -68,7 +68,7 @@ WordCount is the "Hello World" of Big Data processing 
systems. It computes the f
 ~~~java
 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
-DataSet text = env.readTextFile("/path/to/file");
+DataSet text = env.readTextFile("/path/to/file", 
StandardCharsets.UTF_8.name());
 
 DataSet> counts =
 // split up the lines in pairs (2-tuples) containing: (word,1)
@@ -106,7 +106,7 @@ The {% gh_link 
/flink-examples/flink-examples-batch/src/main/java/org/apache/fli
 val env = ExecutionEnvironment.getExecutionEnvironment
 
 // get input data
-val text = env.readTextFile("/path/to/file")
+val text = env.readTextFile("/path/to/file", StandardCharsets.UTF_8.name())
 
 val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
   .map { (_, 1) }
diff --git a/docs/dev/batch/index.md b/docs/dev/batch/index.md
index f0fab8bab6c..1b513e14415 100644
--- a/docs/dev/batch/index.md
+++ b/docs/dev/batch/index.md
@@ -809,9 +809,9 @@ shortcut methods on the *ExecutionEnvironment*.
 
 File-based:
 
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns 
them as Strings.
+- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line 
wise and returns them as Strings.
 
-- `readTextFileWithValue(path)` / `TextValueInputFormat` - Reads files line 
wise and returns them as
+- `readTextFileWithValue(path, charsetName)` / `TextValueInputFormat` - Reads 
files line wise and returns them as
   StringValues. StringValues are mutable strings.
 
 - `readCsvFile(path)` / `CsvInputFormat` - Parses files of comma (or another 
char) delimited fields.
@@ -860,10 +860,10 @@ Generic:
 ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
 
 // read text file from local files system
-DataSet localLines = env.readTextFile("file:///path/to/my/textfile");
+DataSet localLines = env.readTextFile("file:///path/to/my/textfile", 
StandardCharsets.UTF_8.name());
 
 // read text file from a HDFS running at nnHost:nnPort
-DataSet hdfsLines = 
env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
+DataSet hdfsLines = 
env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile", 
StandardCharsets.UTF_8.name());
 
 // read a CSV file with three fields
 DataSet> csvInput = 
env.readCsvFile("hdfs:///the/CSV/file")
@@ -946,7 +946,7 @@ Configuration parameters = new Configuration();
 parameters.setBoolean("recursive.file.enumeration", true);
 
 // pass the configuration to the data source
-DataSet logs = env.readTextFile("file:///path/with.nested/files")
+DataSet logs = env.readTextFile("file:///path/with.nested/files", 
StandardCharsets.UTF_8.name())
  .withParameters(parameters);
 {% endhighlight %}
 
@@ -962,9 +962,9 @@ shortcut methods on the *ExecutionEnvironment*.
 
 File-based:
 
-- `readTextFile(path)` / `TextInputFormat` - Reads files line wise and returns 
them as Strings.
+- `readTextFile(path, charsetName)` / `TextInputFormat` - Reads files line 
wise and returns them as Strings.
 
-- 

[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-11-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691373#comment-16691373
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

zentol commented on issue #5536: [FLINK-8714][Documentation] Added either 
charsetName) or "utf-8" value in examples of readTextFile
URL: https://github.com/apache/flink/pull/5536#issuecomment-439807069
 
 
   I agree with Stephan; will close this PR and update the JIRA accordingly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> ---
>
> Key: FLINK-8714
> URL: https://issues.apache.org/jira/browse/FLINK-8714
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Michal Klempa
>Priority: Trivial
>  Labels: easyfix, newbie, patch-available, pull-request-available
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
> version of {{readTextFile(filePath, charsetName)}}.
>  
> I am open to criticism and suggestions. The listing of {{readTextFile}} I was 
> able to grep in docs is:
> {code:java}
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/libs/storm_compatibility.md:DataStream text = 
> env.readTextFile(localFilePath);
> ./dev/cluster_execution.md:    DataSet data = 
> env.readTextFile("hdfs://path/to/file");
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:DataSet localLines = 
> env.readTextFile("file:///path/to/my/textfile");
> ./dev/batch/index.md:DataSet hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
> ./dev/batch/index.md:DataSet logs = 
> env.readTextFile("file:///path/with.nested/files")
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:val localLines = 
> env.readTextFile("file:///path/to/my/textfile")
> ./dev/batch/index.md:val hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
> ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
> ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile);
> ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile)
> ./dev/batch/examples.md:DataSet text = 
> env.readTextFile("/path/to/file");
> ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file")
> ./dev/api_concepts.md:DataStream text = 
> env.readTextFile("file:///path/to/file");
> 

[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392912#comment-16392912
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/5536
  
Sorry, I think the JavaDoc comment that triggered this change was actually 
incorrect in the first place.

By default, the read methods always use "UTF-8" rather than the system 
default charset, so it is actually not non-deterministic.

I would personally vote fix the javadoc and other docs that incorrectly 
claim this is using the system-dependent charset, and leave the other docs as 
they are (not explicitly pass the same charset name that is anyways passed, 
makes it simpler).


> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> ---
>
> Key: FLINK-8714
> URL: https://issues.apache.org/jira/browse/FLINK-8714
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Michal Klempa
>Priority: Trivial
>  Labels: easyfix, newbie, patch-available
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
> version of {{readTextFile(filePath, charsetName)}}.
>  
> I am open to criticism and suggestions. The listing of {{readTextFile}} I was 
> able to grep in docs is:
> {code:java}
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/libs/storm_compatibility.md:DataStream text = 
> env.readTextFile(localFilePath);
> ./dev/cluster_execution.md:    DataSet data = 
> env.readTextFile("hdfs://path/to/file");
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:DataSet localLines = 
> env.readTextFile("file:///path/to/my/textfile");
> ./dev/batch/index.md:DataSet hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
> ./dev/batch/index.md:DataSet logs = 
> env.readTextFile("file:///path/with.nested/files")
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:val localLines = 
> env.readTextFile("file:///path/to/my/textfile")
> ./dev/batch/index.md:val hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
> ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
> ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile);
> ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile)
> ./dev/batch/examples.md:DataSet text = 
> env.readTextFile("/path/to/file");
> ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file")
> ./dev/api_concepts.md:DataStream text = 
> env.readTextFile("file:///path/to/file");
> ./dev/api_concepts.md:val text: 

[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-02-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374341#comment-16374341
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

Github user michalklempa commented on the issue:

https://github.com/apache/flink/pull/5536
  
@zentol Thanks, done.


> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> ---
>
> Key: FLINK-8714
> URL: https://issues.apache.org/jira/browse/FLINK-8714
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Michal Klempa
>Priority: Trivial
>  Labels: easyfix, newbie, patch-available
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
> version of {{readTextFile(filePath, charsetName)}}.
>  
> I am open to criticism and suggestions. The listing of {{readTextFile}} I was 
> able to grep in docs is:
> {code:java}
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/libs/storm_compatibility.md:DataStream text = 
> env.readTextFile(localFilePath);
> ./dev/cluster_execution.md:    DataSet data = 
> env.readTextFile("hdfs://path/to/file");
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:DataSet localLines = 
> env.readTextFile("file:///path/to/my/textfile");
> ./dev/batch/index.md:DataSet hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
> ./dev/batch/index.md:DataSet logs = 
> env.readTextFile("file:///path/with.nested/files")
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:val localLines = 
> env.readTextFile("file:///path/to/my/textfile")
> ./dev/batch/index.md:val hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
> ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
> ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile);
> ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile)
> ./dev/batch/examples.md:DataSet text = 
> env.readTextFile("/path/to/file");
> ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file")
> ./dev/api_concepts.md:DataStream text = 
> env.readTextFile("file:///path/to/file");
> ./dev/api_concepts.md:val text: DataStream[String] = 
> env.readTextFile("file:///path/to/file")
> ./dev/local_execution.md:    DataSet data = 
> env.readTextFile("file:///path/to/file");
> ./ops/deployment/aws.md:env.readTextFile("s3:///");{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16371074#comment-16371074
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

Github user michalklempa commented on the issue:

https://github.com/apache/flink/pull/5536
  
Unrelated test https://travis-ci.org/apache/flink/jobs/343906340#L5636 
failing.


> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> ---
>
> Key: FLINK-8714
> URL: https://issues.apache.org/jira/browse/FLINK-8714
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Michal Klempa
>Priority: Trivial
>  Labels: easyfix, newbie
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
> version of {{readTextFile(filePath, charsetName)}}.
>  
> I am open to criticism and suggestions. The listing of {{readTextFile}} I was 
> able to grep in docs is:
> {code:java}
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/datastream_api.md:- `readTextFile(path)` - Reads text files, i.e. files 
> that respect the `TextInputFormat` specification, line-by-line and returns 
> them as Strings.
> ./dev/libs/storm_compatibility.md:DataStream text = 
> env.readTextFile(localFilePath);
> ./dev/cluster_execution.md:    DataSet data = 
> env.readTextFile("hdfs://path/to/file");
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:DataSet localLines = 
> env.readTextFile("file:///path/to/my/textfile");
> ./dev/batch/index.md:DataSet hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile");
> ./dev/batch/index.md:DataSet logs = 
> env.readTextFile("file:///path/with.nested/files")
> ./dev/batch/index.md:- `readTextFile(path)` / `TextInputFormat` - Reads files 
> line wise and returns them as Strings.
> ./dev/batch/index.md:- `readTextFileWithValue(path)` / `TextValueInputFormat` 
> - Reads files line wise and returns them as
> ./dev/batch/index.md:val localLines = 
> env.readTextFile("file:///path/to/my/textfile")
> ./dev/batch/index.md:val hdfsLines = 
> env.readTextFile("hdfs://nnHost:nnPort/path/to/my/textfile")
> ./dev/batch/index.md:env.readTextFile("file:///path/with.nested/files").withParameters(parameters)
> ./dev/batch/index.md:DataSet lines = env.readTextFile(pathToTextFile);
> ./dev/batch/index.md:val lines = env.readTextFile(pathToTextFile)
> ./dev/batch/examples.md:DataSet text = 
> env.readTextFile("/path/to/file");
> ./dev/batch/examples.md:val text = env.readTextFile("/path/to/file")
> ./dev/api_concepts.md:DataStream text = 
> env.readTextFile("file:///path/to/file");
> ./dev/api_concepts.md:val text: DataStream[String] = 
> env.readTextFile("file:///path/to/file")
> ./dev/local_execution.md:    DataSet data = 
> env.readTextFile("file:///path/to/file");
> ./ops/deployment/aws.md:env.readTextFile("s3:///");{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8714) Suggest new users to use env.readTextFile method with 2 arguments (using the charset), not to rely on system charset (which varies across environments)

2018-02-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16370075#comment-16370075
 ] 

ASF GitHub Bot commented on FLINK-8714:
---

GitHub user michalklempa opened a pull request:

https://github.com/apache/flink/pull/5536

[FLINK-8714][Documentation] Added either charsetName) or "utf-8" value in 
examples of readTextFile

## What is the purpose of the change
When a newcomer (like me), goes through the docs, there are several places 
where examples encourage to read the input data using the env.readTextFile() 
method.

This method variant does not take a second argument - character set (see 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).
 This versoin relies (according to Javadoc) on " The file will be read with the 
system's default character set. "
*(For example: This pull request makes task deployment go through the blob 
server, rather than through RPC. That way we avoid re-transferring them on each 
deployment (during recovery).)*

Fixing this in documentation by providing charsetName in examples where the 
API is described and "utf-8" as second argument in programming examples. This 
should help others not to forget about the need to specify a charset 
programmatically, if they want to avoid non-deterministic behavior depending on 
environment.

## Brief change log

## Verifying this change

This change is a trivial rework of documentation without any test coverage.

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): no
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
  - The serializers: no
  - The runtime per-record code paths (performance sensitive): no
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: no
  - The S3 file system connector: no

## Documentation

  - Does this pull request introduce a new feature? no
  - If yes, how is the feature documented? not applicable


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/michalklempa/flink 
FLINK-8714_readTextFile_charset_version

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/5536.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5536


commit 221684da5b564b21c1e0cc99e823c18939c0ca91
Author: Michal Klempa 
Date:   2018-02-20T13:50:30Z

FLINK-8714 added either env.readTextFile(pathToFile, charsetName) where the 
API is described or readTextFile(path/to/file, utf-8)  where API is shown as 
example




> Suggest new users to use env.readTextFile method with 2 arguments (using the 
> charset), not to rely on system charset (which varies across environments)
> ---
>
> Key: FLINK-8714
> URL: https://issues.apache.org/jira/browse/FLINK-8714
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.0
>Reporter: Michal Klempa
>Priority: Trivial
>  Labels: easyfix, newbie
>
> When a newcomer (like me), goes through the docs, there are several places 
> where examples encourage to read the input data using the 
> {{env.readTextFile()}} method.
>  
> This method variant does not take a second argument - character set (see 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.html#readTextFile-java.lang.String-).]
>  This versoin relies (according to Javadoc) on " The file will be read with 
> the system's default character set. "
>  
> This behavior is also default in Java, like in the 
> {{java.util.String.getBytes()}} method, where not supplying the charset mean 
> - use the system locale or the one which JVM was started with (see 
> [https://stackoverflow.com/questions/64038/setting-java-locale-settings).] 
> There are two ways to set locale prior to JVM start (-D arguments or set 
> LC_ALL variable).
>  
> Given this is something a new Flink user may not know about, nor he wants to 
> spend hours trying to find the environment-related bug (it works on 
> localhost, but in production the locale is different), I would kindly suggest 
> a change in documentation: lets migrate examples to use the two-argument 
>