Re: Dealing with headers in csv file pyspark

Bryn Keller Wed, 26 Feb 2014 11:53:33 -0800

In the past I've handled this by filtering out the header line, but it
seems to me that it would be useful to have a way of dealing with files
that would preserve sequence, so that e.g. you could just do
mySequentialRDD.drop(1) to get rid of the header. There are other use cases
like this that currently have to be solved outside of Spark or else by
writing a custom InputFormat to do the reading, that perhaps could be
simplified along these lines.


Thanks,
Bryn


On Wed, Feb 26, 2014 at 9:28 AM, Chengi Liu <chengi.liu...@gmail.com> wrote:

> Hi,
>   How do we deal with headers in csv file.
> For example:
> id, counts
> 1,2
> 1,5
> 2,20
> 2,25
> ... and so on
>
>
> And I want to do a frequency count of counts for each id. So result will
> be :
>
> 1,7
> 2,45
>
> and so on..
> My code:
> counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a +
> b))
>
> But I see this error:
> ValueError: invalid literal for int() with base 10: 'counts'
>
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
>     at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
>     at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
>     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
>     at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
>     at
> org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694)
>     at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679)
>
>
> I guess because of the header...
>
> Q1) How do i exclude header from this
> Q2) Rather than using pyspark.. how do i run python programs on spark?
>
> Thanks
>
>
>

Re: Dealing with headers in csv file pyspark

Reply via email to