Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Robert Muir
On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shen  wrote:

>
> Hi Robert,
>
> Understood a silent replacement might not be the desired behavior in
> some use scenarios. Anymore details regarding what "most apps want"
> when there is/are malformed/unmappable? It appears the best the
> underneath de/encoder can do here is to throw an IOException. Given
> the caller of the Reader/Writer does not have the access to the bytes of
> the underlying stream src (reader)/dst(writer), there is in theory
> impossible
> to do anything to recover and continue without risking data loss. The
> assumption here is if you want to have a fine-grained control of the de/
> encoding, you might want to work with the Input/OutStream/Channel +
> CharsetDe/Encoder instead of Reader/Writer.
>
> No, I'm not saying we can't do
> Reader(CharsetDecoder)/Writer(CharsetEncoder),
> just wanted to know what's the real use scenario and what's the better/
> best choice here.
>

I think the exception is the best default? This is the default
behavior of python for example, unless you specifically ask for
"replace" or "ignore".

>>> b'\xFFabc'.decode("utf-8")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position
0: invalid start byte

Its also the default behavior of 'iconv' command-line tool used for
converting charsets, unless you pass additional options.

$ iconv -f utf-8 -t utf-8 test2.mp4
 ftypisomisomiso2avc1mp41e
iconv: test2.mp4:1:26: cannot convert

Unfortunately in java, when using Charset or String parameters, it
gives silently replacement with \uFFFD, etc. Its necessary to pass a
CharsetDecoder to get an exception that something went wrong.

The current situation is especially confusing as there is nothing in
the javadocs to indicate that the behavior of InputStreamReader(x,
Charset) and InputStreamReader(x, String) differ substantially from
InputStreamReader(x, CharsetDecoder) ! I think the Charset and String
parameters should default to REPORT, so the behavior of all
constructors are consistent. If you want to replace, you should have
to ask for it. I think replacement has use-cases but they are more
"expert", e.g. web-crawling and so on. In general, wrong bytes
indicate a problem and it can be very difficult to debug these issues
when java hides these problems by default...


Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.

2018-02-21 Thread Robert Muir
On Wed, Feb 21, 2018 at 8:55 AM, Alan Bateman  wrote:

> Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors
> and methods that take a Charset and eliminate the historical
> inconsistencies. The issue of legacy FileReader/FileWriter is linked from
> that JIRA issue.
>

Can we ensure we have CharsetDecoder/Encoder params too? There is
unfortunately a huge difference between InputStreamReader(x,
StandardCharsets.UTF_8) and InputStreamReader(x,
StandardCharsets.UTF_8.newDecoder()). And the silent replacement of
the "easier" one is probably not what most apps want.


Re: RFR: JDK-8021560,(str) String constructors that take ByteBuffer

2018-02-17 Thread Robert Muir
On Sat, Feb 17, 2018 at 4:05 AM, Alan Bateman  wrote:

> Just to add that the existing low-level / advanced API for this is
> CharsetEncoder. The CoderResult from an encode and the buffer positions
> means you know when there is overflow, the number of characters encoded, and
> how many bytes were added to the buffer. It also gives fine control on how
> encoding errors should be handled and you cache a CharsetEncoder to avoid
> some of the performance anomalies that come up in the Charset vs. charset
> name discussions.

> This is not an API that most developers will ever use
> directly but if the use-case is advanced cases (libraries or frameworks
>

Really? How else are you supposed to convert bytes <-> characters
reliably in java without using CharsetEncoder/Decoder? Its the only
way to get an exception when something is wrong, instead of silently
masking errors with replacement characters (the JDK seems to think its
doing people favors there, its not).


Re: JDK 9 Build 111 seems to miss some locale data, Lucene tests fail with Farsi and Thai language

2016-03-26 Thread Robert Muir
On Sat, Mar 26, 2016 at 7:56 AM, Uwe Schindler  wrote:
>
> (1) Thai's locale does not have working dictionary-based BreakIterator 
> available. The following "check" in Lucene for this fails, because it cannot 
> detect a boundary correctly:

Something sneakier is happening. Months ago when the first jigsaw EA
came out, I hit this same problem, and just like now, could not
reproduce it standalone.

Here is what the lucene test is doing: http://pastebin.com/5YUhjiAa

So Thai works in standalone, but sometimes fails in our tests? Maybe
something else like a compiler issue or depending on other stuff the
JVM has done. I played with it this morning in various ways but cannot
make a simple standalone test that fails!


Re: JDK 9 build 109 -> Lucene's Ant build works again; still missing Hotspot patches

2016-03-19 Thread Robert Muir
On Thu, Mar 17, 2016 at 10:25 AM, Uwe Schindler  wrote:
>
> My local tests showed that the MethodHandle-bug is solved, the other one is 
> hopefully fixed, too. Robert may have a way to quickly reproduce.
>

JDK-8150280 is fixed too, I just tested it. Thanks!


Re: Suggested fix for JDK-4724038 (Add unmap method to MappedByteBuffer)

2015-09-14 Thread Robert Muir
On Wed, Sep 9, 2015 at 11:46 AM, Peter Levart  wrote:
>
> By wanting to truly release the resources you allocated, you are essentially
> wanting to manage the resources yourself. If you are willing to track the
> active mapped byte buffers manually yourself, then what about the following
> idea:
>

As Uwe mentioned that is probably not truly necessary. If lucene
cannot delete a file, it retries it later periodically until it works.
So if things were unmapped "soonish", for the lucene case things would
be fine I think.

I do realize other apps may not have that infrastructure/luxury...