Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-22 Thread Alan Bateman

On 21/02/2018 20:50, Uwe Schindler wrote:

:
Thanks for clarifying! I just wanted to mention this, because those methods are 
different, so you should at least think about it 

These methods were deliberately specified to use UTF-8 and I don't think 
we should change them (changing them for a release or two would cause 
needless breakage of course).


-Alan


RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Uwe Schindler
Hi Alan,

> > The Java 7+ methods in java.nio.file.Files already ignore the default 
> > charset
> and always use UTF-8. How to proceed with those? Should they be changed
> to behave to the new mechanisms? I'd suggest to not do this, as its part of
> the spec (to use UTF-8) and should not rely on external forces, but I wanted
> to bring this in.
> >
> There is no proposal to change these methods.

Thanks for clarifying! I just wanted to mention this, because those methods are 
different, so you should at least think about it 

Uwe

-
Uwe Schindler
uschind...@apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/




Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Robert Muir
On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shen  wrote:

>
> Hi Robert,
>
> Understood a silent replacement might not be the desired behavior in
> some use scenarios. Anymore details regarding what "most apps want"
> when there is/are malformed/unmappable? It appears the best the
> underneath de/encoder can do here is to throw an IOException. Given
> the caller of the Reader/Writer does not have the access to the bytes of
> the underlying stream src (reader)/dst(writer), there is in theory
> impossible
> to do anything to recover and continue without risking data loss. The
> assumption here is if you want to have a fine-grained control of the de/
> encoding, you might want to work with the Input/OutStream/Channel +
> CharsetDe/Encoder instead of Reader/Writer.
>
> No, I'm not saying we can't do
> Reader(CharsetDecoder)/Writer(CharsetEncoder),
> just wanted to know what's the real use scenario and what's the better/
> best choice here.
>

I think the exception is the best default? This is the default
behavior of python for example, unless you specifically ask for
"replace" or "ignore".

>>> b'\xFFabc'.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position
0: invalid start byte

Its also the default behavior of 'iconv' command-line tool used for
converting charsets, unless you pass additional options.

$ iconv -f utf-8 -t utf-8 test2.mp4
 ftypisomisomiso2avc1mp41e
iconv: test2.mp4:1:26: cannot convert

Unfortunately in java, when using Charset or String parameters, it
gives silently replacement with \uFFFD, etc. Its necessary to pass a
CharsetDecoder to get an exception that something went wrong.

The current situation is especially confusing as there is nothing in
the javadocs to indicate that the behavior of InputStreamReader(x,
Charset) and InputStreamReader(x, String) differ substantially from
InputStreamReader(x, CharsetDecoder) ! I think the Charset and String
parameters should default to REPORT, so the behavior of all
constructors are consistent. If you want to replace, you should have
to ask for it. I think replacement has use-cases but they are more
"expert", e.g. web-crawling and so on. In general, wrong bytes
indicate a problem and it can be very difficult to debug these issues
when java hides these problems by default...


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Xueming Shen

On 2/21/18, 6:26 AM, Robert Muir wrote:

On Wed, Feb 21, 2018 at 8:55 AM, Alan Bateman  wrote:


Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors
and methods that take a Charset and eliminate the historical
inconsistencies. The issue of legacy FileReader/FileWriter is linked from
that JIRA issue.


Can we ensure we have CharsetDecoder/Encoder params too? There is
unfortunately a huge difference between InputStreamReader(x,
StandardCharsets.UTF_8) and InputStreamReader(x,
StandardCharsets.UTF_8.newDecoder()). And the silent replacement of
the "easier" one is probably not what most apps want.

Hi Robert,

Understood a silent replacement might not be the desired behavior in
some use scenarios. Anymore details regarding what "most apps want"
when there is/are malformed/unmappable? It appears the best the
underneath de/encoder can do here is to throw an IOException. Given
the caller of the Reader/Writer does not have the access to the bytes of
the underlying stream src (reader)/dst(writer), there is in theory 
impossible

to do anything to recover and continue without risking data loss. The
assumption here is if you want to have a fine-grained control of the de/
encoding, you might want to work with the Input/OutStream/Channel +
CharsetDe/Encoder instead of Reader/Writer.

No, I'm not saying we can't do 
Reader(CharsetDecoder)/Writer(CharsetEncoder),

just wanted to know what's the real use scenario and what's the better/
best choice here.

-Sherman




Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Xueming Shen

Hi Volker,

Yes, the handing of sun.jnu.encoding will not be changed. It will remain as
a read-only/informative system property.

sun.jnu.encoding is really an implementation details (as well as 
file.encoding,
though in this JEP file.encoding might be used to provide a mechanism to 
fallback
to the current/old/existing behavior, so might become a public/official 
interface/
system property). From API perspective the Charset.defaultCharset() is 
the only

place to obtain the "Java virtual machine's default charset".

As Alan said in previous comment, clarifications will be included in the 
final

version based on feedback/suggestion

-Sherman

On 2/21/18, 8:11 AM, Volker Simonis wrote:

Hi Sherman,

the tricky part is really "sun.jnu.encoding" and how the VM interacts
with the underlying OS. You may remember that we had an interesting
discussion about this topic some time ago [1].

As far as I understand, the JEP doesn't plan to change the handling of
"sun.jnu.encoding". So does this mean that the VM will still correctly
start and work on system with a platform encoding different from
UTF-8? I.e. will starting the VM from a path which contains characters
in that special platform encoding or classpath/argument settings with
characters in that special character encoding still work? If the
answer will be yes (which I expect) maybe you could explain that a
little more detailed in the JEP. I.e. the JEP should say that it
changes the default encoding for the Java API classes and not the
default encoding for natively accessing system resources.

Maybe the JEP should also mention that "sun.jnu.encoding" is more or
less a "read-only" property which can not be reliable set by the user
on the command line (it's a chicken-egg problem: for the parsing of
the command line we need the correct encoding, so it can not be
reliably set on the command line).

For these reasons the Summary "Use UTF-8 as the Java virtual machine's
default charset ..." is a little misleading. Maybe you could rephrase
to something like "Use UTF-8 as the default charset so that Java APIs
that depend on the default charset behave consistently across all
platforms."

Thank you and best regards,
Volker

[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.html#37516

On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shen  wrote:

This draft JEP contains a proposal to use UTF-8 as the default charset for
the JVM, so that
APIs that depend on the default charset behave consistently cross all
platforms.

For more details, please see:
https://bugs.openjdk.java.net/browse/JDK-8187041

Sherman




Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Volker Simonis
Hi Sherman,

the tricky part is really "sun.jnu.encoding" and how the VM interacts
with the underlying OS. You may remember that we had an interesting
discussion about this topic some time ago [1].

As far as I understand, the JEP doesn't plan to change the handling of
"sun.jnu.encoding". So does this mean that the VM will still correctly
start and work on system with a platform encoding different from
UTF-8? I.e. will starting the VM from a path which contains characters
in that special platform encoding or classpath/argument settings with
characters in that special character encoding still work? If the
answer will be yes (which I expect) maybe you could explain that a
little more detailed in the JEP. I.e. the JEP should say that it
changes the default encoding for the Java API classes and not the
default encoding for natively accessing system resources.

Maybe the JEP should also mention that "sun.jnu.encoding" is more or
less a "read-only" property which can not be reliable set by the user
on the command line (it's a chicken-egg problem: for the parsing of
the command line we need the correct encoding, so it can not be
reliably set on the command line).

For these reasons the Summary "Use UTF-8 as the Java virtual machine's
default charset ..." is a little misleading. Maybe you could rephrase
to something like "Use UTF-8 as the default charset so that Java APIs
that depend on the default charset behave consistently across all
platforms."

Thank you and best regards,
Volker

[1] 
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-December/thread.html#37516

On Wed, Feb 21, 2018 at 7:31 AM, Xueming Shen  wrote:
> This draft JEP contains a proposal to use UTF-8 as the default charset for
> the JVM, so that
> APIs that depend on the default charset behave consistently cross all
> platforms.
>
> For more details, please see:
> https://bugs.openjdk.java.net/browse/JDK-8187041
>
> Sherman


RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Uwe Schindler
Hi,

Thanks Alan for the link to this issue about FileReader/Writer!

Uwe

-
Uwe Schindler
uschind...@apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/

> -Original Message-
> From: core-libs-dev [mailto:core-libs-dev-boun...@openjdk.java.net] On
> Behalf Of Alan Bateman
> Sent: Wednesday, February 21, 2018 2:55 PM
> To: Stephen Colebourne ; core-libs-
> d...@openjdk.java.net
> Subject: Re: Draft JEP: To use UTF-8 as the default charset for the Java 
> virtual
> machine.
> 
> On 21/02/2018 13:41, Stephen Colebourne wrote:
> > On 21 February 2018 at 13:37, Alan Bateman 
> wrote:
> >> The proposal is to eventually get to the point that the default charset
> >> cannot be changed. It will take several releases to get there due to the
> >> potential compatibility impact.
> > This seems like a reasonable strategy to solve the problem.
> >
> > I also agree that all locations where a default charset is used need
> > to have a method alongside that takes a CharSet, eg. FileWriter.
> >
> Good progress was made via JDK-8183743 [1] in Java SE 10 to add
> constructors and methods that take a Charset and eliminate the
> historical inconsistencies. The issue of legacy FileReader/FileWriter is
> linked from that JIRA issue.
> 
> -Alan
> 
> [1] https://bugs.openjdk.java.net/browse/JDK-8183743



Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Robert Muir
On Wed, Feb 21, 2018 at 8:55 AM, Alan Bateman  wrote:

> Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors
> and methods that take a Charset and eliminate the historical
> inconsistencies. The issue of legacy FileReader/FileWriter is linked from
> that JIRA issue.
>

Can we ensure we have CharsetDecoder/Encoder params too? There is
unfortunately a huge difference between InputStreamReader(x,
StandardCharsets.UTF_8) and InputStreamReader(x,
StandardCharsets.UTF_8.newDecoder()). And the silent replacement of
the "easier" one is probably not what most apps want.


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Alan Bateman

On 21/02/2018 08:53, Uwe Schindler wrote:

:
The Java 7+ methods in java.nio.file.Files already ignore the default charset 
and always use UTF-8. How to proceed with those? Should they be changed to 
behave to the new mechanisms? I'd suggest to not do this, as its part of the 
spec (to use UTF-8) and should not rely on external forces, but I wanted to 
bring this in.


There is no proposal to change these methods.

-Alan


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Alan Bateman

On 21/02/2018 13:41, Stephen Colebourne wrote:

On 21 February 2018 at 13:37, Alan Bateman  wrote:

The proposal is to eventually get to the point that the default charset
cannot be changed. It will take several releases to get there due to the
potential compatibility impact.

This seems like a reasonable strategy to solve the problem.

I also agree that all locations where a default charset is used need
to have a method alongside that takes a CharSet, eg. FileWriter.

Good progress was made via JDK-8183743 [1] in Java SE 10 to add 
constructors and methods that take a Charset and eliminate the 
historical inconsistencies. The issue of legacy FileReader/FileWriter is 
linked from that JIRA issue.


-Alan

[1] https://bugs.openjdk.java.net/browse/JDK-8183743


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Stephen Colebourne
On 21 February 2018 at 13:37, Alan Bateman  wrote:
> The proposal is to eventually get to the point that the default charset
> cannot be changed. It will take several releases to get there due to the
> potential compatibility impact.

This seems like a reasonable strategy to solve the problem.

I also agree that all locations where a default charset is used need
to have a method alongside that takes a CharSet, eg. FileWriter.

Stephen


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Alan Bateman

On 21/02/2018 13:19, David Lloyd wrote:

I agree with Uwe and Remi; if the default is still changeable, the
problem doesn't go away, it simply becomes slightly more insidious.

The proposal is to eventually get to the point that the default charset 
cannot be changed. It will take several releases to get there due to the 
potential compatibility impact. This draft JEP is the first step to 
switch to UTF-8 by default. A first step has to allow it be changed in 
order to keep some existing code/deployments working. Sorry this isn't 
clear in the JEP yet, there several clarifications to this JEP that 
haven't been included yet (on my list, I didn't realize it would be 
discussed here this week).


-Alan


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread David Lloyd
I agree with Uwe and Remi; if the default is still changeable, the
problem doesn't go away, it simply becomes slightly more insidious.

On Wed, Feb 21, 2018 at 12:31 AM, Xueming Shen  wrote:
> This draft JEP contains a proposal to use UTF-8 as the default charset for
> the JVM, so that
> APIs that depend on the default charset behave consistently cross all
> platforms.
>
> For more details, please see:
> https://bugs.openjdk.java.net/browse/JDK-8187041
>
> Sherman



-- 
- DML


RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Remi Forax
I agree with Uwe,
we should deprecate all methods/constructors that relies on the default 
charset. 

And we should do that before changing to use UTF-8 by default.

Remi


On February 21, 2018 8:53:54 AM UTC, Uwe Schindler  
wrote:
>Hi,
>
>> This draft JEP contains a proposal to use UTF-8 as the default
>charset
>> for the JVM, so that
>> APIs that depend on the default charset behave consistently cross all
>> platforms.
>> 
>> For more details, please see:
>> https://bugs.openjdk.java.net/browse/JDK-8187041
>
>Thanks for finally adding a JEP like this. Thanks also to Robert Muir
>for always insisting in fixing this problem! I have a few comments:
>
>The JEP should NOT cause that new APIs, which may convert between
>characters and bytes to no longer explicitly accept a charset. One
>example is the proposed ByteBuffer methods taking String. The default
>ones would work with UTF-8, but it should still be possible to an API
>user to always add a charset whenever there is a conversion between
>bytes and chars. This is especially important as the user may still
>change the default and breaking your app. Because the rule is still:
>Only YOU, the developer, know the charset of your stuff when you load a
>JAR resource file or pass a String to the network in a ByteBuffer!
>
>The biggest offenders on this is also given as an example: FileReader
>and FileWriter. Although both classes subclass
>InputStreamReader/OutputStreamWriter and just pass the right delegate
>to the superclass in the ctor, both classes are missing the possibility
>to specify a charset. Because of this, the use of FileReader and
>FileWriter is completely forbidden in many Apache projects (Apache
>Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also
>fix the API here and just add the missing ctors.
>
>The Java 7+ methods in java.nio.file.Files already ignore the default
>charset and always use UTF-8. How to proceed with those? Should they be
>changed to behave to the new mechanisms? I'd suggest to not do this, as
>its part of the spec (to use UTF-8) and should not rely on external
>forces, but I wanted to bring this in.
>
>Changing the default would help many users, if they are actually using
>newer JDKs. For those with older versions (and compiling their code
>against older versions), you still have to avoid the default charsets.
>In addition, as you still can change the "default charset", any library
>developer reading resources from its own JAR file or passing Strings to
>network protocols cannot rely on the fact, that the default charset is
>really UTF-8! (a user may have changed it to something else). Because
>of this, Apache libraries will forbid usage of all methods using
>default charsets (and locales + timezones). The "changeable default"
>does not affect application developers (because they have in most cases
>control about the environment), but library developers should always be
>explicit!
>
>For this to work, I also want to do some "advertisement": All library
>projects should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan
>their bytecode for offenders using default charsets, default locales or
>relying on default timezones. See the blog post about it [1] and the
>project page [2]. The tool is also useful to replace "jdeps" in
>projects with Java versions before 8, as it can scan your code for
>access to internal JDK APIs, too. See the documentation [3] and github
>wiki pages for useful examples. It may also be a good idea to mention
>it in the JEP as a "workaround" or "further reading".
>
>Finally: Because one can still change the default, I'd propose to
>deprecate all methods that use a default charset (unrelated to actually
>changing the default). Only if you do this, it would make tools like
>"forbiddenapis" irrelevant for library developers.
>
>And finally, finally: I'd also propose to change the default Locale to
>Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales
>still break thousands of apps! But that's a different JEP - but I would
>strongly support it!
>
>Uwe
>
>[1]
>http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
>[2] https://github.com/policeman-tools/forbidden-apis
>[3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/
>
>-
>Uwe Schindler
>uschind...@apache.org 
>ASF Member, Apache Lucene PMC / Committer
>Bremen, Germany
>http://lucene.apache.org/

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


RE: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Uwe Schindler
Hi,

> This draft JEP contains a proposal to use UTF-8 as the default charset
> for the JVM, so that
> APIs that depend on the default charset behave consistently cross all
> platforms.
> 
> For more details, please see:
> https://bugs.openjdk.java.net/browse/JDK-8187041

Thanks for finally adding a JEP like this. Thanks also to Robert Muir for 
always insisting in fixing this problem! I have a few comments:

The JEP should NOT cause that new APIs, which may convert between characters 
and bytes to no longer explicitly accept a charset. One example is the proposed 
ByteBuffer methods taking String. The default ones would work with UTF-8, but 
it should still be possible to an API user to always add a charset whenever 
there is a conversion between bytes and chars. This is especially important as 
the user may still change the default and breaking your app. Because the rule 
is still: Only YOU, the developer, know the charset of your stuff when you load 
a JAR resource file or pass a String to the network in a ByteBuffer!

The biggest offenders on this is also given as an example: FileReader and 
FileWriter. Although both classes subclass InputStreamReader/OutputStreamWriter 
and just pass the right delegate to the superclass in the ctor, both classes 
are missing the possibility to specify a charset. Because of this, the use of 
FileReader and FileWriter is completely forbidden in many Apache projects 
(Apache Lucene, Solr, Elasticsearch, Apache TIKA,...). So I'd suggest to also 
fix the API here and just add the missing ctors.

The Java 7+ methods in java.nio.file.Files already ignore the default charset 
and always use UTF-8. How to proceed with those? Should they be changed to 
behave to the new mechanisms? I'd suggest to not do this, as its part of the 
spec (to use UTF-8) and should not rely on external forces, but I wanted to 
bring this in.

Changing the default would help many users, if they are actually using newer 
JDKs. For those with older versions (and compiling their code against older 
versions), you still have to avoid the default charsets. In addition, as you 
still can change the "default charset", any library developer reading resources 
from its own JAR file or passing Strings to network protocols cannot rely on 
the fact, that the default charset is really UTF-8! (a user may have changed it 
to something else). Because of this, Apache libraries will forbid usage of all 
methods using default charsets (and locales + timezones). The "changeable 
default" does not affect application developers (because they have in most 
cases control about the environment), but library developers should always be 
explicit!

For this to work, I also want to do some "advertisement": All library projects 
should use the Forbidden-Apis Maven/Gradle/Ant plugin to scan their bytecode 
for offenders using default charsets, default locales or relying on default 
timezones. See the blog post about it [1] and the project page [2]. The tool is 
also useful to replace "jdeps" in projects with Java versions before 8, as it 
can scan your code for access to internal JDK APIs, too. See the documentation 
[3] and github wiki pages for useful examples. It may also be a good idea to 
mention it in the JEP as a "workaround" or "further reading".

Finally: Because one can still change the default, I'd propose to deprecate all 
methods that use a default charset (unrelated to actually changing the 
default). Only if you do this, it would make tools like "forbiddenapis" 
irrelevant for library developers.

And finally, finally: I'd also propose to change the default Locale to 
Locale.ROOT (same issues). The String.toLowerCase() in Turkish locales still 
break thousands of apps! But that's a different JEP - but I would strongly 
support it!

Uwe

[1] http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
[2] https://github.com/policeman-tools/forbidden-apis
[3] https://jenkins.thetaphi.de/job/Forbidden-APIs/javadoc/

-
Uwe Schindler
uschind...@apache.org 
ASF Member, Apache Lucene PMC / Committer
Bremen, Germany
http://lucene.apache.org/