How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

DEVAN M.S. Tue, 20 Oct 2015 22:09:08 -0700

Hi all,


I am trying to calculate Histogram of all columns from a CSV file using
Spark Scala.
I found that DoubleRDDFunctions supporting Histogram.
So i coded like following for getting histogram of all columns.

1. Get column count
2. Create RDD[double]  of each column and calculate Histogram of each RDD
using DoubleRDDFunctions

      var columnIndexArray = Array.tabulate(rdd.first().length) (_ * 1)
      val histogramData = columnIndexArray.map(columns=>{
         rdd.map(lines => lines(columns)).histogram(6)
         })

Is it a good way ?
Can anyone suggest some better ways to tackle this ?


Thanks in advance.

How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

Reply via email to