At least works for me though, temporarily disabled Kyro serilizer until upgrade to 1.6.0. Appreciate for your update. :) Luciano Resende <luckbr1...@gmail.com>于2016年2月9日 周二02:37写道:
> Sorry, same expected results with trunk and Kryo serializer > > On Mon, Feb 8, 2016 at 4:15 AM, SLiZn Liu <sliznmail...@gmail.com> wrote: > >> I’ve found the trigger of my issue: if I start my spark-shell or submit >> by spark-submit with --conf >> spark.serializer=org.apache.spark.serializer.KryoSerializer, the >> DataFrame content goes wrong, as I described earlier. >> >> >> On Mon, Feb 8, 2016 at 5:42 PM SLiZn Liu <sliznmail...@gmail.com> wrote: >> >>> Thanks Luciano, now it looks like I’m the only guy who have this issue. >>> My options is narrowed down to upgrade my spark to 1.6.0, to see if this >>> issue is gone. >>> >>> — >>> Cheers, >>> Todd Leo >>> >>> >>> >>> On Mon, Feb 8, 2016 at 2:12 PM Luciano Resende <luckbr1...@gmail.com> >>> wrote: >>> >>>> I tried in both 1.5.0, 1.6.0 and 2.0.0 trunk and >>>> com.databricks:spark-csv_2.10:1.3.0 with expected results, where the >>>> columns seem to be read properly. >>>> >>>> +----------+----------------------+ >>>> |C0 |C1 | >>>> +----------+----------------------+ >>>> >>>> |1446566430 | 2015-11-04<SP>00:00:30| >>>> |1446566430 | 2015-11-04<SP>00:00:30| >>>> |1446566430 | 2015-11-04<SP>00:00:30| >>>> |1446566430 | 2015-11-04<SP>00:00:30| >>>> |1446566430 | 2015-11-04<SP>00:00:30| >>>> |1446566431 | 2015-11-04<SP>00:00:31| >>>> |1446566431 | 2015-11-04<SP>00:00:31| >>>> |1446566431 | 2015-11-04<SP>00:00:31| >>>> |1446566431 | 2015-11-04<SP>00:00:31| >>>> |1446566431 | 2015-11-04<SP>00:00:31| >>>> +----------+----------------------+ >>>> >>>> >>>> >>>> >>>> On Sat, Feb 6, 2016 at 11:44 PM, SLiZn Liu <sliznmail...@gmail.com> >>>> wrote: >>>> >>>>> Hi Spark Users Group, >>>>> >>>>> I have a csv file to analysis with Spark, but I’m troubling with >>>>> importing as DataFrame. >>>>> >>>>> Here’s the minimal reproducible example. Suppose I’m having a >>>>> *10(rows)x2(cols)* *space-delimited csv* file, shown as below: >>>>> >>>>> 1446566430 2015-11-04<SP>00:00:30 >>>>> 1446566430 2015-11-04<SP>00:00:30 >>>>> 1446566430 2015-11-04<SP>00:00:30 >>>>> 1446566430 2015-11-04<SP>00:00:30 >>>>> 1446566430 2015-11-04<SP>00:00:30 >>>>> 1446566431 2015-11-04<SP>00:00:31 >>>>> 1446566431 2015-11-04<SP>00:00:31 >>>>> 1446566431 2015-11-04<SP>00:00:31 >>>>> 1446566431 2015-11-04<SP>00:00:31 >>>>> 1446566431 2015-11-04<SP>00:00:31 >>>>> >>>>> the <SP> in column 2 represents sub-delimiter within that column, and >>>>> this file is stored on HDFS, let’s say the path is hdfs:///tmp/1.csv >>>>> >>>>> I’m using *spark-csv* to import this file as Spark *DataFrame*: >>>>> >>>>> sqlContext.read.format("com.databricks.spark.csv") >>>>> .option("header", "false") // Use first line of all files as >>>>> header >>>>> .option("inferSchema", "false") // Automatically infer data types >>>>> .option("delimiter", " ") >>>>> .load("hdfs:///tmp/1.csv") >>>>> .show >>>>> >>>>> Oddly, the output shows only a part of each column: >>>>> >>>>> [image: Screenshot from 2016-02-07 15-27-51.png] >>>>> >>>>> and even the boundary of the table wasn’t shown correctly. I also used >>>>> the other way to read csv file, by sc.textFile(...).map(_.split(" ")) >>>>> and sqlContext.createDataFrame, and the result is the same. Can >>>>> someone point me out where I did it wrong? >>>>> >>>>> — >>>>> BR, >>>>> Todd Leo >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Luciano Resende >>>> http://people.apache.org/~lresende >>>> http://twitter.com/lresende1975 >>>> http://lresende.blogspot.com/ >>>> >>> > > > -- > Luciano Resende > http://people.apache.org/~lresende > http://twitter.com/lresende1975 > http://lresende.blogspot.com/ >