Hi all,

I am trying to calculate Histogram of all columns from a CSV file using
Spark Scala.
I found that DoubleRDDFunctions supporting Histogram.
So i coded like following for getting histogram of all columns.

1. Get column count
2. Create RDD[double]  of each column and calculate Histogram of each RDD
using DoubleRDDFunctions

      var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1)
      val histogramData = columnIndexArray.map(columns=>{
         rdd.map(lines => lines(columns)).histogram(6)
         })

Is it a good way ?
Can anyone suggest some better ways to tackle this ?


Thanks in advance.

Reply via email to