Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Bryn Keller
In the past I've handled this by filtering out the header line, but it seems to me that it would be useful to have a way of dealing with files that would preserve sequence, so that e.g. you could just do mySequentialRDD.drop(1) to get rid of the header. There are other use cases like this that curr

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Ewen Cheslack-Postava
You must be parsing each line of the file at some point anyway, so adding a step to filter out the header should work fine. It'll get executed at the same time as your parsing/conversion to ints, so there's no significant overhead aside from the check itself. For standalone programs, there's a

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
I am not sure.. the suggestion is to open a TB file and remove a line? That doesnt sounds that good. I am hacking my way by using a filter.. Can I put a try:except clause in my lambda function.. Maybe i should just try that out. But thanks for the suggestion. Also, can i run scripts against spark

Re: Dealing with headers in csv file pyspark

2014-02-26 Thread Mayur Rustagi
Bad solution is to run a mapper through the data and null the counts , good solution is to trim the header before hand without Spark. On Feb 26, 2014 9:28 AM, "Chengi Liu" wrote: > Hi, > How do we deal with headers in csv file. > For example: > id, counts > 1,2 > 1,5 > 2,20 > 2,25 > ... and so

Dealing with headers in csv file pyspark

2014-02-26 Thread Chengi Liu
Hi, How do we deal with headers in csv file. For example: id, counts 1,2 1,5 2,20 2,25 ... and so on And I want to do a frequency count of counts for each id. So result will be : 1,7 2,45 and so on.. My code: counts = data.map(lambda x: (x[0],int(x[1]))).reduceByKey(lambda a, b: a + b)) But