Re: Reading CSV with multiLine option invalidates encoding option.

Hyukjin Kwon Thu, 17 Aug 2017 07:23:52 -0700

For when multiLine is not set, we currently only support ascii-compatible
encodings, up to my knowledge, mainly due to line separator and as I
investigated in the comment.
For when multiLine is set, it appears encoding is not considered. I
actually meant encoding does not work at all in this case in the comment
but it looks I should have been clearer on this.


I have been aware of it but I personally think encoding option is rather
left incomplete due to non-ascii compatible encodings and this actually
brings complexity. For at least over a year, I have been (personally)
wondering if we should keep extending this feature and if we could rather
deprecate this option.

The direction itself in your diff looks roughly correct and I can't deny
that's a valid issue and fix for the current status.

Workaround should be, to make a custom Hadoop input format and read it as
text dataset and parse it with DataFrameReader.csv(csvDataset:
Dataset[String]) for now.



2017-08-17 19:42 GMT+09:00 Han-Cheol Cho <prian...@gmail.com>:

> Hi,
>
> Thank you for your response.
> I finally found the cause of this
>
>
> When multiLine option is set, input file is read by
> UnivocityParser.parseStream() method.
> This method, in turn, calls convertStream() that initializes tokenizer
> with tokenizer.beginParsing(inputStream) and parses records using
> tokenizer.parseNext().
>
> The problem is that beginParsing() method uses UTF-8 as its default
> char-encoding.
> As a result, user provided "encoding" option will be ignored.
>
>
> When multiLine option is NOT set, on the other hand, input file is first
> read and decoded from TextInputCSVDataSource.readFile() method.
> Then, it is sent to UnivocityParser.parseIterator() method.
> Therefore, no problem is occurred in in this case.
>
>
> To solve this problem, I removed the call for tokenizer.beginParsing()
> method in convertStream() since we cannot access options.charset variable
> here.
> Then, added it to two places: tokenizeStream() and parseStream() methods.
> Especially, in parseStream() method, I added charset as the second
> parameter for beginParsing() method.
>
> I attached git diff content as an attachment file.
> I appreciate any comments on this.
>
>
> Best wishes,
> Han-Cheol
>
>
>
>
> On Wed, Aug 16, 2017 at 3:09 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Hi,
>>
>> Since the csv source currently supports ascii-compatible charset, so I
>> guess shift-jis also works well.
>> You could check Hyukjin's comment in https://issues.apache.org/j
>> ira/browse/SPARK-21289 for more info.
>>
>>
>> On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho <prian...@gmail.com>
>> wrote:
>>
>>> My apologies,
>>>
>>> It was a problem of our Hadoop cluster.
>>> When we tested the same code on another cluster (HDP-based), it worked
>>> without any problem.
>>>
>>> ```scala
>>> ## make sjis text
>>> cat a.txt
>>> 8月データだけでやってみよう
>>> nkf -W -s a.txt >b.txt
>>> cat b.txt
>>> 87n%G!<%?$@$1$G$d$C$F$_$h$&
>>> nkf -s -w b.txt
>>> 8月データだけでやってみよう
>>> hdfs dfs -put a.txt b.txt
>>>
>>> ## YARN mode test
>>> spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
>>> +--------------+
>>> |           _c0|
>>> +--------------+
>>> |8月データだけでやってみよう|
>>> +--------------+
>>>
>>> spark.read.option("encoding", "sjis").csv("b.txt").show(1)
>>> +--------------+
>>> |           _c0|
>>> +--------------+
>>> |8月データだけでやってみよう|
>>> +--------------+
>>>
>>> spark.read.option("encoding", "utf-8").option("multiLine",
>>> true).csv("a.txt").show(1)
>>> +--------------+
>>> |           _c0|
>>> +--------------+
>>> |8月データだけでやってみよう|
>>> +--------------+
>>>
>>> spark.read.option("encoding", "sjis").option("multiLine",
>>> true).csv("b.txt").show(1)
>>> +--------------+
>>> |           _c0|
>>> +--------------+
>>> |8月データだけでやってみよう|
>>> +--------------+
>>> ```
>>>
>>> I am still digging the root cause and will share it later :-)
>>>
>>> Best wishes,
>>> Han-Choel
>>>
>>>
>>> On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho <prian...@gmail.com>
>>> wrote:
>>>
>>>> Dear Spark ML members,
>>>>
>>>>
>>>> I experienced a trouble in using "multiLine" option to load CSV data
>>>> with Shift-JIS encoding.
>>>> When option("multiLine", true) is specified, option("encoding",
>>>> "encoding-name") just doesn't work anymore.
>>>>
>>>>
>>>> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
>>>> method doesn't use parser.options.charset at all.
>>>>
>>>> object MultiLineCSVDataSource extends CSVDataSource {
>>>>   override val isSplitable: Boolean = false
>>>>
>>>>   override def readFile(
>>>>       conf: Configuration,
>>>>       file: PartitionedFile,
>>>>       parser: UnivocityParser,
>>>>       schema: StructType): Iterator[InternalRow] = {
>>>>     UnivocityParser.parseStream(
>>>>       CodecStreams.createInputStreamWithCloseResource(conf,
>>>> file.filePath),
>>>>       parser.options.headerFlag,
>>>>       parser,
>>>>       schema)
>>>>   }
>>>>   ...
>>>>
>>>> On the other hand, TextInputCSVDataSource.readFile() method uses it:
>>>>
>>>>   override def readFile(
>>>>       conf: Configuration,
>>>>       file: PartitionedFile,
>>>>       parser: UnivocityParser,
>>>>       schema: StructType): Iterator[InternalRow] = {
>>>>     val lines = {
>>>>       val linesReader = new HadoopFileLinesReader(file, conf)
>>>>       Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_
>>>> => linesReader.close()))
>>>>       linesReader.map { line =>
>>>>         new String(line.getBytes, 0, line.getLength,
>>>> parser.options.charset)    // <---- charset option is used here.
>>>>       }
>>>>     }
>>>>
>>>>     val shouldDropHeader = parser.options.headerFlag && file.start == 0
>>>>     UnivocityParser.parseIterator(lines, shouldDropHeader, parser,
>>>> schema)
>>>>   }
>>>>
>>>>
>>>> It seems like a bug.
>>>> Is there anyone who had the same problem before?
>>>>
>>>>
>>>> Best wishes,
>>>> Han-Cheol
>>>>
>>>> --
>>>> ==================================
>>>> Han-Cheol Cho, Ph.D.
>>>> Data scientist, Data Science Team, Data Laboratory
>>>> NHN Techorus Corp.
>>>>
>>>> Homepage: https://sites.google.com/site/priancho/
>>>> ==================================
>>>>
>>>
>>>
>>>
>>> --
>>> ==================================
>>> Han-Cheol Cho, Ph.D.
>>> Data scientist, Data Science Team, Data Laboratory
>>> NHN Techorus Corp.
>>>
>>> Homepage: https://sites.google.com/site/priancho/
>>> ==================================
>>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>
>
> --
> ==================================
> Han-Cheol Cho, Ph.D.
> Data scientist, Data Science Team, Data Laboratory
> NHN Techorus Corp.
>
> Homepage: https://sites.google.com/site/priancho/
> ==================================
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Re: Reading CSV with multiLine option invalidates encoding option.

Reply via email to