In the past I've handled this by filtering out the header line, but it seems to me that it would be useful to have a way of dealing with files that would preserve sequence, so that e.g. you could just do mySequentialRDD.drop(1) to get rid of the header. There are other use cases like this that currently have to be solved outside of Spark or else by writing a custom InputFormat to do the reading, that perhaps could be simplified along these lines.
Thanks, Bryn On Wed, Feb 26, 2014 at 9:28 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > How do we deal with headers in csv file. > For example: > id, counts > 1,2 > 1,5 > 2,20 > 2,25 > ... and so on > > > And I want to do a frequency count of counts for each id. So result will > be : > > 1,7 > 2,45 > > and so on.. > My code: > counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + > b)) > > But I see this error: > ValueError: invalid literal for int() with base 10: 'counts' > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131) > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:232) > at > org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:694) > at > org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:679) > > > I guess because of the header... > > Q1) How do i exclude header from this > Q2) Rather than using pyspark.. how do i run python programs on spark? > > Thanks > > >