Assuming the CSV is well-formed (every row has the same number of columns)
and every column is a number, this is how you can do it. You can adjust so
that you pick just the columns you want, of course, by mapping each row to
a new Array that contains just the column values you want. Just be sure the
logic selects the same columns for every row or your stats might look funny.
val rdd: RDD[Array[Double]] = ???
rdd.mapPartitions(vs => {
Iterator(vs.toArray.transpose.map(StatCounter(_)))
}).reduce((as, bs) => as.zipWithIndex.map {
case (a, i) => a.merge(bs(i))
})
On Mon, Aug 25, 2014 at 9:50 AM, Hingorani, Vineet <[email protected]
> wrote:
> Hello Victor,
>
>
>
> I want to do it on multiple columns. I was able to do it on one column by
> the help of Sean using code below.
>
>
>
> val matData = file.map(_.split(";"))
>
> val stats = matData.map(_(2).toDouble).stats()
>
> stats.mean
>
> stats.max
>
>
>
> Thank you
>
>
>
> Vineet
>
>
>
> *From:* Victor Tso-Guillen [mailto:[email protected]]
> *Sent:* Montag, 25. August 2014 18:34
> *To:* Hingorani, Vineet
> *Cc:* [email protected]
> *Subject:* Re: Manipulating columns in CSV file or Transpose of
> Array[Array[String]] RDD
>
>
>
> Do you want to do this on one column or all numeric columns?
>
>
>
> On Mon, Aug 25, 2014 at 7:09 AM, Hingorani, Vineet <
> [email protected]> wrote:
>
> Hello all,
>
> Could someone help me with the manipulation of csv file data. I have
> 'semicolon' separated csv data including doubles and strings. I want to
> calculate the maximum/average of a column. When I read the file using
> sc.textFile(test.csv).map(_.split(";"), each field is read as string. Could
> someone help me with the above manipulation and how to do that.
>
> Or maybe if there is some way to take the transpose of the data and then
> manipulating the rows in some way?
>
> Thank you in advance, I am struggling with this thing for quite sometime
>
> Regards,
> Vineet
>
>
>