Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On Wed, Feb 21, 2018 at 1:16 PM, Xueming Shenwrote: > > Hi Robert, > > Understood a silent replacement might not be the desired behavior in > some use scenarios. Anymore details regarding what "most apps want" > when there is/are malformed/unmappable? It appears the best the > underneath de/encoder can do here is to throw an IOException. Given > the caller of the Reader/Writer does not have the access to the bytes of > the underlying stream src (reader)/dst(writer), there is in theory > impossible > to do anything to recover and continue without risking data loss. The > assumption here is if you want to have a fine-grained control of the de/ > encoding, you might want to work with the Input/OutStream/Channel + > CharsetDe/Encoder instead of Reader/Writer. > > No, I'm not saying we can't do > Reader(CharsetDecoder)/Writer(CharsetEncoder), > just wanted to know what's the real use scenario and what's the better/ > best choice here. > I think the exception is the best default? This is the default behavior of python for example, unless you specifically ask for "replace" or "ignore". >>> b'\xFFabc'.decode("utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte Its also the default behavior of 'iconv' command-line tool used for converting charsets, unless you pass additional options. $ iconv -f utf-8 -t utf-8 test2.mp4 ftypisomisomiso2avc1mp41e iconv: test2.mp4:1:26: cannot convert Unfortunately in java, when using Charset or String parameters, it gives silently replacement with \uFFFD, etc. Its necessary to pass a CharsetDecoder to get an exception that something went wrong. The current situation is especially confusing as there is nothing in the javadocs to indicate that the behavior of InputStreamReader(x, Charset) and InputStreamReader(x, String) differ substantially from InputStreamReader(x, CharsetDecoder) ! I think the Charset and String parameters should default to REPORT, so the behavior of all constructors are consistent. If you want to replace, you should have to ask for it. I think replacement has use-cases but they are more "expert", e.g. web-crawling and so on. In general, wrong bytes indicate a problem and it can be very difficult to debug these issues when java hides these problems by default...
Re: Draft JEP: To use UTF-8 as the default charset for the Java virtual machine.
On Wed, Feb 21, 2018 at 8:55 AM, Alan Batemanwrote: > Good progress was made via JDK-8183743 [1] in Java SE 10 to add constructors > and methods that take a Charset and eliminate the historical > inconsistencies. The issue of legacy FileReader/FileWriter is linked from > that JIRA issue. > Can we ensure we have CharsetDecoder/Encoder params too? There is unfortunately a huge difference between InputStreamReader(x, StandardCharsets.UTF_8) and InputStreamReader(x, StandardCharsets.UTF_8.newDecoder()). And the silent replacement of the "easier" one is probably not what most apps want.
Re: RFR: JDK-8021560,(str) String constructors that take ByteBuffer
On Sat, Feb 17, 2018 at 4:05 AM, Alan Batemanwrote: > Just to add that the existing low-level / advanced API for this is > CharsetEncoder. The CoderResult from an encode and the buffer positions > means you know when there is overflow, the number of characters encoded, and > how many bytes were added to the buffer. It also gives fine control on how > encoding errors should be handled and you cache a CharsetEncoder to avoid > some of the performance anomalies that come up in the Charset vs. charset > name discussions. > This is not an API that most developers will ever use > directly but if the use-case is advanced cases (libraries or frameworks > Really? How else are you supposed to convert bytes <-> characters reliably in java without using CharsetEncoder/Decoder? Its the only way to get an exception when something is wrong, instead of silently masking errors with replacement characters (the JDK seems to think its doing people favors there, its not).
Re: JDK 9 Build 111 seems to miss some locale data, Lucene tests fail with Farsi and Thai language
On Sat, Mar 26, 2016 at 7:56 AM, Uwe Schindlerwrote: > > (1) Thai's locale does not have working dictionary-based BreakIterator > available. The following "check" in Lucene for this fails, because it cannot > detect a boundary correctly: Something sneakier is happening. Months ago when the first jigsaw EA came out, I hit this same problem, and just like now, could not reproduce it standalone. Here is what the lucene test is doing: http://pastebin.com/5YUhjiAa So Thai works in standalone, but sometimes fails in our tests? Maybe something else like a compiler issue or depending on other stuff the JVM has done. I played with it this morning in various ways but cannot make a simple standalone test that fails!
Re: JDK 9 build 109 -> Lucene's Ant build works again; still missing Hotspot patches
On Thu, Mar 17, 2016 at 10:25 AM, Uwe Schindlerwrote: > > My local tests showed that the MethodHandle-bug is solved, the other one is > hopefully fixed, too. Robert may have a way to quickly reproduce. > JDK-8150280 is fixed too, I just tested it. Thanks!
Re: Suggested fix for JDK-4724038 (Add unmap method to MappedByteBuffer)
On Wed, Sep 9, 2015 at 11:46 AM, Peter Levartwrote: > > By wanting to truly release the resources you allocated, you are essentially > wanting to manage the resources yourself. If you are willing to track the > active mapped byte buffers manually yourself, then what about the following > idea: > As Uwe mentioned that is probably not truly necessary. If lucene cannot delete a file, it retries it later periodically until it works. So if things were unmapped "soonish", for the lucene case things would be fine I think. I do realize other apps may not have that infrastructure/luxury...