Re: Imported CSV file content isn't identical to the original file

SLiZn Liu Mon, 08 Feb 2016 01:44:16 -0800

Thanks Luciano, now it looks like I’m the only guy who have this issue. My
options is narrowed down to upgrade my spark to 1.6.0, to see if this issue
is gone.


—
Cheers,
Todd Leo



On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1...@gmail.com> wrote:

> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and
> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the
> columns seem to be read properly.
>
>  +----------+----------------------+
> |C0        |C1                    |
> +----------+----------------------+
>
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566430 | 2015-11-04<SP>00:00:30|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> |1446566431 | 2015-11-04<SP>00:00:31|
> +----------+----------------------+
>
>
>
>
> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com> wrote:
>
>> Hi Spark Users Group,
>>
>> I have a csv file to analysis with Spark, but I’m troubling with
>> importing as DataFrame.
>>
>> Here’s the minimal reproducible example. Suppose I’m having a
>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below:
>>
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566430 2015-11-04<SP>00:00:30
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>> 1446566431 2015-11-04<SP>00:00:31
>>
>> the <SP> in column 2 represents sub-delimiter within that column, and
>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv
>>
>> I’m using *spark-csv* to import this file as Spark *DataFrame*:
>>
>> sqlContext.read.format("com.databricks.spark.csv")
>>         .option("header", "false") // Use first line of all files as header
>>         .option("inferSchema", "false") // Automatically infer data types
>>         .option("delimiter", " ")
>>         .load("hdfs:///tmp/1.csv")
>>         .show
>>
>> Oddly, the output shows only a part of each column:
>>
>> [image: Screenshot from 2016-02-07 15-27-51.png]
>>
>> and even the boundary of the table wasn’t shown correctly. I also used
>> the other way to read csv file, by sc.textFile(...).map(_.split(" "))
>> and sqlContext.createDataFrame, and the result is the same. Can someone
>> point me out where I did it wrong?
>>
>> —
>> BR,
>> Todd Leo
>> 
>>
>
>
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Re: Imported CSV file content isn't identical to the original file

Reply via email to