Re: Imported CSV file content isn't identical to the original file

SLiZn Liu Sun, 14 Feb 2016 01:03:27 -0800

This Error message does not appear as I upgraded to 1.6.0 .

--
Cheers,
Todd Leo


On Tue, Feb 9, 2016 at 9:07 AM SLiZn Liu <sliznmail...@gmail.com> wrote:

> At least works for me though, temporarily disabled Kyro serilizer until
> upgrade to 1.6.0. Appreciate for your update. :)
> Luciano Resende <luckbr1...@gmail.com>于2016年2月9日 周二02:37写道：
>
>> Sorry, same expected results with trunk and Kryo serializer
>>
>> On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sliznmail...@gmail.com> wrote:
>>
>>> I’ve found the trigger of my issue: if I start my spark-shell or submit
>>> by spark-submit with --conf
>>> spark.serializer=org.apache.spark.serializer.KryoSerializer, the
>>> DataFrame content goes wrong, as I described earlier.
>>> 
>>>
>>> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmail...@gmail.com> wrote:
>>>
>>>> Thanks Luciano, now it looks like I’m the only guy who have this issue.
>>>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this
>>>> issue is gone.
>>>>
>>>> —
>>>> Cheers,
>>>> Todd Leo
>>>>
>>>>
>>>> 
>>>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1...@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
>>>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
>>>>> columns seem to be read properly.
>>>>>
>>>>>  +----------+----------------------+
>>>>> |C0        |C1                    |
>>>>> +----------+----------------------+
>>>>>
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566430 | 2015-11-04<SP>00:00:30|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> |1446566431 | 2015-11-04<SP>00:00:31|
>>>>> +----------+----------------------+
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark Users Group,
>>>>>>
>>>>>> I have a csv file to analysis with Spark, but I’m troubling with
>>>>>> importing as DataFrame.
>>>>>>
>>>>>> Here’s the minimal reproducible example. Suppose I’m having a
>>>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>>>>>
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566430 2015-11-04<SP>00:00:30
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>> 1446566431 2015-11-04<SP>00:00:31
>>>>>>
>>>>>> the <SP> in column 2 represents sub-delimiter within that column,
>>>>>> and this file is stored on HDFS, let’s say the path is
>>>>>> hdfs:///tmp/1.csv
>>>>>>
>>>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>>>>>
>>>>>> sqlContext.read.format("com.databricks.spark.csv")
>>>>>>         .option("header", "false") // Use first line of all files as 
>>>>>> header
>>>>>>         .option("inferSchema", "false") // Automatically infer data types
>>>>>>         .option("delimiter", " ")
>>>>>>         .load("hdfs:///tmp/1.csv")
>>>>>>         .show
>>>>>>
>>>>>> Oddly, the output shows only a part of each column:
>>>>>>
>>>>>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>>>>>
>>>>>> and even the boundary of the table wasn’t shown correctly. I also
>>>>>> used the other way to read csv file, by sc.textFile(...).map(_.split("
>>>>>> ")) and sqlContext.createDataFrame, and the result is the same. Can
>>>>>> someone point me out where I did it wrong?
>>>>>>
>>>>>> —
>>>>>> BR,
>>>>>> Todd Leo
>>>>>> 
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Luciano Resende
>>>>> http://people.apache.org/~lresende
>>>>> http://twitter.com/lresende1975
>>>>> http://lresende.blogspot.com/
>>>>>
>>>>
>>
>>
>> --
>> Luciano Resende
>> http://people.apache.org/~lresende
>> http://twitter.com/lresende1975
>> http://lresende.blogspot.com/
>>
>

Re: Imported CSV file content isn't identical to the original file

Reply via email to