Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 21/02/2018 20:50, Uwe Schindler wrote: : Thanks for clarifying! I just wanted to mention this, because those methods are different, so you should at least think about it These methods were deliberately specified to use UTF-8 and I don't think we should change them (changing them for a release or two would cause needless breakage of course). -Alan
RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Hi Alan, > > The Java 7+ methods in java.nio.file.Files already ignore the default > > charset > and always use UTF-8. How to proceed with those? Should they be changed > to behave to the new mechanisms? I'd suggest to not do this, as its part of > the spec (to use UTF-8) and should not rely on external forces, but I wanted > to bring this in. > > > There is no proposal to change these methods. Thanks for clarifying! I just wanted to mention this, because those methods are different, so you should at least think about it Uwe - Uwe Schindler uschind...@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shenwrote: > > Hi Robert, > > Understood a silent replacement might not be the desired behavior in > some use scenarios. Anymore details regarding what "most apps want" > when there is/are malformed/unmappable? It appears the best the > underneath de/encoder can do here is to throw an IOException. Given > the caller of the Reader/Writer does not have the access to the bytes of > the underlying stream src (reader)/dst(writer), there is in theory > impossible > to do anything to recover and continue without risking data loss. The > assumption here is if you want to have a fine-grained control of the de/ > encoding, you might want to work with the Input/OutStream/Channel + > CharsetDe/Encoder instead of Reader/Writer. > > No, I'm not saying we can't do > Reader(CharsetDecoder)/Writer(CharsetEncoder), > just wanted to know what's the real use scenario and what's the better/ > best choice here. > I think the exception is the best default? This is the default behavior of python for example, unless you specifically ask for "replace" or "ignore". >>> b'\xFFabc'.decode("utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Its also the default behavior of 'iconv' command-line tool used for converting charsets, unless you pass additional options. $ iconv -f utf-8 -t utf-8 test2.mp4 ftypisomisomiso2avc1mp41e iconv: test2.mp4:1:26: cannot convert Unfortunately in java, when using Charset or String parameters, it gives silently replacement with \uFFFD, etc. Its necessary to pass a CharsetDecoder to get an exception that something went wrong. The current situation is especially confusing as there is nothing in the javadocs to indicate that the behavior of InputStreamReader(x, Charset) and InputStreamReader(x, String) differ substantially from InputStreamReader(x, CharsetDecoder) ! I think the Charset and String parameters should default to REPORT, so the behavior of all constructors are consistent. If you want to replace, you should have to ask for it. I think replacement has use-cases but they are more "expert", e.g. web-crawling and so on. In general, wrong bytes indicate a problem and it can be very difficult to debug these issues when java hides these problems by default...
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 2/21/18, 6:26 AM, Robert Muir wrote: On Wed, Feb 21, 2018 at 8:55 AM, Alan Batemanwrote: Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue. Can we ensure we have CharsetDecoder/Encoder params too? There is unfortunately a huge difference between InputStreamReader(x, StandardCharsets.UTF_8) and InputStreamReader(x, StandardCharsets.UTF_8.newDecoder()). And the silent replacement of the "easier" one is probably not what most apps want. Hi Robert, Understood a silent replacement might not be the desired behavior in some use scenarios. Anymore details regarding what "most apps want" when there is/are malformed/unmappable? It appears the best the underneath de/encoder can do here is to throw an IOException. Given the caller of the Reader/Writer does not have the access to the bytes of the underlying stream src (reader)/dst(writer), there is in theory impossible to do anything to recover and continue without risking data loss. The assumption here is if you want to have a fine-grained control of the de/ encoding, you might want to work with the Input/OutStream/Channel + CharsetDe/Encoder instead of Reader/Writer. No, I'm not saying we can't do Reader(CharsetDecoder)/Writer(CharsetEncoder), just wanted to know what's the real use scenario and what's the better/ best choice here. -Sherman
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Hi Volker, Yes, the handing of sun.jnu.encoding will not be changed. It will remain as a read-only/informative system property. sun.jnu.encoding is really an implementation details (as well as file.encoding, though in this JEP file.encoding might be used to provide a mechanism to fallback to the current/old/existing behavior, so might become a public/official interface/ system property). From API perspective the Charset.defaultCharset() is the only place to obtain the "Java virtual machine's default charset". As Alan said in previous comment, clarifications will be included in the final version based on feedback/suggestion -Sherman On 2/21/18, 8:11 AM, Volker Simonis wrote: Hi Sherman, the tricky part is really "sun.jnu.encoding" and how the VM interacts with the underlying OS. You may remember that we had an interesting discussion about this topic some time ago [1]. As far as I understand, the JEP doesn't plan to change the handling of "sun.jnu.encoding". So does this mean that the VM will still correctly start and work on system with a platform encoding different from UTF-8? I.e. will starting the VM from a path which contains characters in that special platform encoding or classpath/argument settings with characters in that special character encoding still work? If the answer will be yes (which I expect) maybe you could explain that a little more detailed in the JEP. I.e. the JEP should say that it changes the default encoding for the Java API classes and not the default encoding for natively accessing system resources. Maybe the JEP should also mention that "sun.jnu.encoding" is more or less a "read-only" property which can not be reliable set by the user on the command line (it's a chicken-egg problem: for the parsing of the command line we need the correct encoding, so it can not be reliably set on the command line). For these reasons the Summary "Use UTF-8 as the Java virtual machine's default charset ..." is a little misleading. Maybe you could rephrase to something like "Use UTF-8 as the default charset so that Java APIs that depend on the default charset behave consistently across all platforms." Thank you and best regards, Volker [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.html#37516 On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shenwrote: This draft JEP contains a proposal to use UTF-8 as the default charset for the JVM, so that APIs that depend on the default charset behave consistently cross all platforms. For more details, please see: https://bugs.openjdk.java.net/browse/JDK-8187041 Sherman
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Hi Sherman, the tricky part is really "sun.jnu.encoding" and how the VM interacts with the underlying OS. You may remember that we had an interesting discussion about this topic some time ago [1]. As far as I understand, the JEP doesn't plan to change the handling of "sun.jnu.encoding". So does this mean that the VM will still correctly start and work on system with a platform encoding different from UTF-8? I.e. will starting the VM from a path which contains characters in that special platform encoding or classpath/argument settings with characters in that special character encoding still work? If the answer will be yes (which I expect) maybe you could explain that a little more detailed in the JEP. I.e. the JEP should say that it changes the default encoding for the Java API classes and not the default encoding for natively accessing system resources. Maybe the JEP should also mention that "sun.jnu.encoding" is more or less a "read-only" property which can not be reliable set by the user on the command line (it's a chicken-egg problem: for the parsing of the command line we need the correct encoding, so it can not be reliably set on the command line). For these reasons the Summary "Use UTF-8 as the Java virtual machine's default charset ..." is a little misleading. Maybe you could rephrase to something like "Use UTF-8 as the default charset so that Java APIs that depend on the default charset behave consistently across all platforms." Thank you and best regards, Volker [1] http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.html#37516 On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shenwrote: > This draft JEP contains a proposal to use UTF-8 as the default charset for > the JVM, so that > APIs that depend on the default charset behave consistently cross all > platforms. > > For more details, please see: > https://bugs.openjdk.java.net/browse/JDK-8187041 > > Sherman
RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Hi, Thanks Alan for the link to this issue about FileReader/Writer! Uwe - Uwe Schindler uschind...@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/ > -Original Message- > From: core-libs-dev [mailto:core-libs-dev-boun...@openjdk.java.net] On > Behalf Of Alan Bateman > Sent: Wednesday, February 21, 2018 2:55 PM > To: Stephen Colebourne; core-libs- > d...@openjdk.java.net > Subject: Re: Draft JEP: To use UTF-8 as the default charset for the Java > virtual > machine. > > On 21/02/2018 13:41, Stephen Colebourne wrote: > > On 21 February 2018 at 13:37, Alan Bateman > wrote: > >> The proposal is to eventually get to the point that the default charset > >> cannot be changed. It will take several releases to get there due to the > >> potential compatibility impact. > > This seems like a reasonable strategy to solve the problem. > > > > I also agree that all locations where a default charset is used need > > to have a method alongside that takes a CharSet, eg. FileWriter. > > > Good progress was made via JDK-8183743 [1] in Java SE 10 to add > constructors and methods that take a Charset and eliminate the > historical inconsistencies. The issue of legacy FileReader/FileWriter is > linked from that JIRA issue. > > -Alan > > [1] https://bugs.openjdk.java.net/browse/JDK-8183743
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On Wed, Feb 21, 2018 at 8:55 AM, Alan Batemanwrote: > Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors > and methods that take a Charset and eliminate the historical > inconsistencies. The issue of legacy FileReader/FileWriter is linked from > that JIRA issue. > Can we ensure we have CharsetDecoder/Encoder params too? There is unfortunately a huge difference between InputStreamReader(x, StandardCharsets.UTF_8) and InputStreamReader(x, StandardCharsets.UTF_8.newDecoder()). And the silent replacement of the "easier" one is probably not what most apps want.
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 21/02/2018 08:53, Uwe Schindler wrote: : The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in. There is no proposal to change these methods. -Alan
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 21/02/2018 13:41, Stephen Colebourne wrote: On 21 February 2018 at 13:37, Alan Batemanwrote: The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact. This seems like a reasonable strategy to solve the problem. I also agree that all locations where a default charset is used need to have a method alongside that takes a CharSet, eg. FileWriter. Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors and methods that take a Charset and eliminate the historical inconsistencies. The issue of legacy FileReader/FileWriter is linked from that JIRA issue. -Alan [1] https://bugs.openjdk.java.net/browse/JDK-8183743
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 21 February 2018 at 13:37, Alan Batemanwrote: > The proposal is to eventually get to the point that the default charset > cannot be changed. It will take several releases to get there due to the > potential compatibility impact. This seems like a reasonable strategy to solve the problem. I also agree that all locations where a default charset is used need to have a method alongside that takes a CharSet, eg. FileWriter. Stephen
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On 21/02/2018 13:19, David Lloyd wrote: I agree with Uwe and Remi; if the default is still changeable, the problem doesn't go away, it simply becomes slightly more insidious. The proposal is to eventually get to the point that the default charset cannot be changed. It will take several releases to get there due to the potential compatibility impact. This draft JEP is the first step to switch to UTF-8 by default. A first step has to allow it be changed in order to keep some existing code/deployments working. Sorry this isn't clear in the JEP yet, there several clarifications to this JEP that haven't been included yet (on my list, I didn't realize it would be discussed here this week). -Alan
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
I agree with Uwe and Remi; if the default is still changeable, the problem doesn't go away, it simply becomes slightly more insidious. On Wed, Feb 21, 2018 at 12:31 AM, Xueming Shenwrote: > This draft JEP contains a proposal to use UTF-8 as the default charset for > the JVM, so that > APIs that depend on the default charset behave consistently cross all > platforms. > > For more details, please see: > https://bugs.openjdk.java.net/browse/JDK-8187041 > > Sherman -- - DML
RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
I agree with Uwe, we should deprecate all methods/constructors that relies on the default charset. And we should do that before changing to use UTF-8 by default. Remi On February 21, 2018 8:53:54 AM UTC, Uwe Schindlerwrote: >Hi, > >> This draft JEP contains a proposal to use UTF-8 as the default >charset >> for the JVM, so that >> APIs that depend on the default charset behave consistently cross all >> platforms. >> >> For more details, please see: >> https://bugs.openjdk.java.net/browse/JDK-8187041 > >Thanks for finally adding a JEP like this. Thanks also to Robert Muir >for always insisting in fixing this problem! I have a few comments: > >The JEP should NOT cause that new APIs, which may convert between >characters and bytes to no longer explicitly accept a charset. One >example is the proposed ByteBuffer methods taking String. The default >ones would work with UTF-8, but it should still be possible to an API >user to always add a charset whenever there is a conversion between >bytes and chars. This is especially important as the user may still >change the default and breaking your app. Because the rule is still: >Only YOU, the developer, know the charset of your stuff when you load a >JAR resource file or pass a String to the network in a ByteBuffer! > >The biggest offenders on this is also given as an example: FileReader >and FileWriter. Although both classes subclass >InputStreamReader/OutputStreamWriter and just pass the right delegate >to the superclass in the ctor, both classes are missing the possibility >to specify a charset. Because of this, the use of FileReader and >FileWriter is completely forbidden in many Apache projects (Apache >Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also >fix the API here and just add the missing ctors. > >The Java 7+ methods in java.nio.file.Files already ignore the default >charset and always use UTF-8. How to proceed with those? Should they be >changed to behave to the new mechanisms? I'd suggest to not do this, as >its part of the spec (to use UTF-8) and should not rely on external >forces, but I wanted to bring this in. > >Changing the default would help many users, if they are actually using >newer JDKs. For those with older versions (and compiling their code >against older versions), you still have to avoid the default charsets. >In addition, as you still can change the "default charset", any library >developer reading resources from its own JAR file or passing Strings to >network protocols cannot rely on the fact, that the default charset is >really UTF-8! (a user may have changed it to something else). Because >of this, Apache libraries will forbid usage of all methods using >default charsets (and locales + timezones). The "changeable default" >does not affect application developers (because they have in most cases >control about the environment), but library developers should always be >explicit! > >For this to work, I also want to do some "advertisement": All library >projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan >their bytecode for offenders using default charsets, default locales or >relying on default timezones. See the blog post about it [1] and the >project page [2]. The tool is also useful to replace "jdeps" in >projects with Java versions before 8, as it can scan your code for >access to internal JDK APIs, too. See the documentation [3] and github >wiki pages for useful examples. It may also be a good idea to mention >it in the JEP as a "workaround" or "further reading". > >Finally: Because one can still change the default, I'd propose to >deprecate all methods that use a default charset (unrelated to actually >changing the default). Only if you do this, it would make tools like >"forbiddenapis" irrelevant for library developers. > >And finally, finally: I'd also propose to change the default Locale to >Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales >still break thousands of apps! But that's a different JEP - but I would >strongly support it! > >Uwe > >[1] >http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html >[2] https://github.com/policeman-tools/forbidden-apis >[3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/ > >- >Uwe Schindler >uschind...@apache.org >ASF Member, Apache Lucene PMC / Committer >Bremen, Germany >http://lucene.apache.org/ -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
Hi, > This draft JEP contains a proposal to use UTF-8 as the default charset > for the JVM, so that > APIs that depend on the default charset behave consistently cross all > platforms. > > For more details, please see: > https://bugs.openjdk.java.net/browse/JDK-8187041 Thanks for finally adding a JEP like this. Thanks also to Robert Muir for always insisting in fixing this problem! I have a few comments: The JEP should NOT cause that new APIs, which may convert between characters and bytes to no longer explicitly accept a charset. One example is the proposed ByteBuffer methods taking String. The default ones would work with UTF-8, but it should still be possible to an API user to always add a charset whenever there is a conversion between bytes and chars. This is especially important as the user may still change the default and breaking your app. Because the rule is still: Only YOU, the developer, know the charset of your stuff when you load a JAR resource file or pass a String to the network in a ByteBuffer! The biggest offenders on this is also given as an example: FileReader and FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter and just pass the right delegate to the superclass in the ctor, both classes are missing the possibility to specify a charset. Because of this, the use of FileReader and FileWriter is completely forbidden in many Apache projects (Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also fix the API here and just add the missing ctors. The Java 7+ methods in java.nio.file.Files already ignore the default charset and always use UTF-8. How to proceed with those? Should they be changed to behave to the new mechanisms? I'd suggest to not do this, as its part of the spec (to use UTF-8) and should not rely on external forces, but I wanted to bring this in. Changing the default would help many users, if they are actually using newer JDKs. For those with older versions (and compiling their code against older versions), you still have to avoid the default charsets. In addition, as you still can change the "default charset", any library developer reading resources from its own JAR file or passing Strings to network protocols cannot rely on the fact, that the default charset is really UTF-8! (a user may have changed it to something else). Because of this, Apache libraries will forbid usage of all methods using default charsets (and locales + timezones). The "changeable default" does not affect application developers (because they have in most cases control about the environment), but library developers should always be explicit! For this to work, I also want to do some "advertisement": All library projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode for offenders using default charsets, default locales or relying on default timezones. See the blog post about it [1] and the project page [2]. The tool is also useful to replace "jdeps" in projects with Java versions before 8, as it can scan your code for access to internal JDK APIs, too. See the documentation [3] and github wiki pages for useful examples. It may also be a good idea to mention it in the JEP as a "workaround" or "further reading". Finally: Because one can still change the default, I'd propose to deprecate all methods that use a default charset (unrelated to actually changing the default). Only if you do this, it would make tools like "forbiddenapis" irrelevant for library developers. And finally, finally: I'd also propose to change the default Locale to Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still break thousands of apps! But that's a different JEP - but I would strongly support it! Uwe [1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html [2] https://github.com/policeman-tools/forbidden-apis [3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/ - Uwe Schindler uschind...@apache.org ASF Member, Apache Lucene PMC / Committer Bremen, Germany http://lucene.apache.org/