Re: UTF-16 support for TextInputFormat

Fabian Hueske Tue, 21 Aug 2018 02:10:13 -0700

Thanks for creating FLINK-10134 and adding your suggestions!

Best, Fabian


2018-08-13 23:55 GMT+02:00 David Dreyfus <dddrey...@gmail.com>:

> Hi Fabian,
>
> I've added FLINK-10134. FLINK-10134
> <https://issues.apache.org/jira/browse/FLINK-10134>. I'm not sure you'd
> consider it a blocker or that I've identified the right component.
> I'm afraid I don't have the bandwidth or knowledge to make the kind of
> pull request you really need. I do hope my suggestions prove a little
> useful.
>
> Thank you,
> David
>
> On Fri, Aug 10, 2018 at 5:41 AM Fabian Hueske <fhue...@gmail.com> wrote:
>
>> Hi David,
>>
>> Thanks for digging into the code! I had a quick look into the classes as
>> well.
>> As far as I can see, your analysis is correct and the BOM handling in
>> DelimitedInputFormat and TextInputFormat (and other text-based IFs such as
>> CsvInputFormat) is broken.
>> In fact, its obvious that nobody paid attention to this yet.
>>
>> It would be great if you could open a Jira issue and copy your analysis
>> and solution proposal into it.
>> While on it, we could also deprecated the (duplicated) setCharsetName()
>> method from TextInputFormat and redirect it to DelimitedInputFormat.
>> setCharset().
>>
>> Would you also be interested in contributing a fix for this problem?
>>
>> Best, Fabian
>>
>> [1] https://github.com/apache/flink/blob/master/flink-java/
>> src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95
>>
>> 2018-08-09 14:55 GMT+02:00 David Dreyfus <dddrey...@gmail.com>:
>>
>>> Hi Fabian,
>>>
>>> Thank you for taking my email.
>>> TextInputFormat.setCharsetName("UTF-16") appears to set the private
>>> variable TextInputFormat.charsetName.
>>> It doesn't appear to cause additional behavior that would help interpret
>>> UTF-16 data.
>>>
>>> The method I've tested is calling DelimitedInputFormat.setCharset("UTF-16"),
>>> which then sets TextInputFormat.charsetName and then modifies the
>>> previously set delimiterString to construct the proper byte string encoding
>>> of the the delimiter. This same charsetName is also used in
>>> TextInputFormat.readRecord() to interpret the bytes read from the file.
>>>
>>> There are two problems that this implementation would seem to have when
>>> using UTF-16.
>>>
>>>    1. delimiterString.getBytes(getCharset()) in
>>>    DelimitedInputFormat.java will return a Big Endian byte sequence 
>>> including
>>>    the Byte Order Mark (BOM). The actual text file will not contain a BOM at
>>>    each line ending, so the delimiter will never be read. Moreover, if the
>>>    actual byte encoding of the file is Little Endian, the bytes will be
>>>    interpreted incorrectly.
>>>    2. TextInputFormat.readRecord() will not see a BOM each time it
>>>    decodes a byte sequence with the String(bytes, offset, numBytes, charset)
>>>    call. Therefore, it will assume Big Endian, which may not always be 
>>> correct.
>>>
>>> While there are likely many solutions, I would think that all of them
>>> would have to start by reading the BOM from the file when a Split is opened
>>> and then using that BOM to modify the specified encoding to a BOM specific
>>> one when the caller doesn't specify one, and to overwrite the caller's
>>> specification if the BOM is in conflict with the caller's specification.
>>> That is, if the BOM indicates Little Endian and the caller indicates
>>> UTF-16BE, Flink should rewrite the charsetName as UTF-16LE.
>>>
>>> I hope this makes sense and that I haven't been testing incorrectly or
>>> misreading the code.
>>>
>>> Thank you,
>>> David
>>>
>>> On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fhue...@gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> Did you try to set the encoding on the TextInputFormat with
>>>>
>>>> TextInputFormat tif = ...
>>>> tif.setCharsetName("UTF-16");
>>>>
>>>> Best, Fabian
>>>>
>>>> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dddrey...@gmail.com>:
>>>>
>>>>> Hello -
>>>>>
>>>>> It does not appear that Flink supports a charset encoding of "UTF-16".
>>>>> It particular, it doesn't appear that Flink consumes the Byte Order Mark
>>>>> (BOM) to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are 
>>>>> there
>>>>> any plans to enhance Flink to handle UTF-16 with BOM?
>>>>>
>>>>> Thank you,
>>>>> David
>>>>>
>>>>
>>>>
>>

Re: UTF-16 support for TextInputFormat

Reply via email to