I am writing a data-profiling application that needs to iterate over a large
.gz file (imported as a Dataset<Row>). Each key-value pair in the hashmap
will be the row value and the number of times it occurs in the column. There
is one hashmap for each column, and they are all added to a JSON at the end.

For now, I am using the following logic to generate the hashmap for a
column:

Dataset<Row> freq = df
                .groupBy(columnName)
                .count();

HashMap<String, String> myHashMap = new HashMap<>();

Iterator<Row> rowIterator = freq.toLocalIterator();
while(rowIterator.hasNext()) {
            Row currRow = rowIterator.next();
            String rowString = currRow.toString();
            String[] contents = rowString.substring(1, rowString.length() -
1).split(",");
            Double percent = Long.valueOf(contents[1])*100.0/numOfRows;
            myHashMap.put(contents[0], Double.toString(percent));
}

I have also tried converting to RDD and using the collectAsMap() function,
but both of these are taking a very long time (about 5 minutes per column,
where each column has approx. 30 million rows). Is there a more efficient
way to achieve the same? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to