Thanks for creating FLINK-10134 and adding your suggestions! Best, Fabian
2018-08-13 23:55 GMT+02:00 David Dreyfus <dddrey...@gmail.com>: > Hi Fabian, > > I've added FLINK-10134. FLINK-10134 > <https://issues.apache.org/jira/browse/FLINK-10134>. I'm not sure you'd > consider it a blocker or that I've identified the right component. > I'm afraid I don't have the bandwidth or knowledge to make the kind of > pull request you really need. I do hope my suggestions prove a little > useful. > > Thank you, > David > > On Fri, Aug 10, 2018 at 5:41 AM Fabian Hueske <fhue...@gmail.com> wrote: > >> Hi David, >> >> Thanks for digging into the code! I had a quick look into the classes as >> well. >> As far as I can see, your analysis is correct and the BOM handling in >> DelimitedInputFormat and TextInputFormat (and other text-based IFs such as >> CsvInputFormat) is broken. >> In fact, its obvious that nobody paid attention to this yet. >> >> It would be great if you could open a Jira issue and copy your analysis >> and solution proposal into it. >> While on it, we could also deprecated the (duplicated) setCharsetName() >> method from TextInputFormat and redirect it to DelimitedInputFormat. >> setCharset(). >> >> Would you also be interested in contributing a fix for this problem? >> >> Best, Fabian >> >> [1] https://github.com/apache/flink/blob/master/flink-java/ >> src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95 >> >> 2018-08-09 14:55 GMT+02:00 David Dreyfus <dddrey...@gmail.com>: >> >>> Hi Fabian, >>> >>> Thank you for taking my email. >>> TextInputFormat.setCharsetName("UTF-16") appears to set the private >>> variable TextInputFormat.charsetName. >>> It doesn't appear to cause additional behavior that would help interpret >>> UTF-16 data. >>> >>> The method I've tested is calling DelimitedInputFormat.setCharset("UTF-16"), >>> which then sets TextInputFormat.charsetName and then modifies the >>> previously set delimiterString to construct the proper byte string encoding >>> of the the delimiter. This same charsetName is also used in >>> TextInputFormat.readRecord() to interpret the bytes read from the file. >>> >>> There are two problems that this implementation would seem to have when >>> using UTF-16. >>> >>> 1. delimiterString.getBytes(getCharset()) in >>> DelimitedInputFormat.java will return a Big Endian byte sequence >>> including >>> the Byte Order Mark (BOM). The actual text file will not contain a BOM at >>> each line ending, so the delimiter will never be read. Moreover, if the >>> actual byte encoding of the file is Little Endian, the bytes will be >>> interpreted incorrectly. >>> 2. TextInputFormat.readRecord() will not see a BOM each time it >>> decodes a byte sequence with the String(bytes, offset, numBytes, charset) >>> call. Therefore, it will assume Big Endian, which may not always be >>> correct. >>> >>> While there are likely many solutions, I would think that all of them >>> would have to start by reading the BOM from the file when a Split is opened >>> and then using that BOM to modify the specified encoding to a BOM specific >>> one when the caller doesn't specify one, and to overwrite the caller's >>> specification if the BOM is in conflict with the caller's specification. >>> That is, if the BOM indicates Little Endian and the caller indicates >>> UTF-16BE, Flink should rewrite the charsetName as UTF-16LE. >>> >>> I hope this makes sense and that I haven't been testing incorrectly or >>> misreading the code. >>> >>> Thank you, >>> David >>> >>> On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fhue...@gmail.com> wrote: >>> >>>> Hi David, >>>> >>>> Did you try to set the encoding on the TextInputFormat with >>>> >>>> TextInputFormat tif = ... >>>> tif.setCharsetName("UTF-16"); >>>> >>>> Best, Fabian >>>> >>>> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dddrey...@gmail.com>: >>>> >>>>> Hello - >>>>> >>>>> It does not appear that Flink supports a charset encoding of "UTF-16". >>>>> It particular, it doesn't appear that Flink consumes the Byte Order Mark >>>>> (BOM) to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are >>>>> there >>>>> any plans to enhance Flink to handle UTF-16 with BOM? >>>>> >>>>> Thank you, >>>>> David >>>>> >>>> >>>> >>