I am writing a data-profiling application that needs to iterate over a large .gz file (imported as a Dataset<Row>). Each key-value pair in the hashmap will be the row value and the number of times it occurs in the column. There is one hashmap for each column, and they are all added to a JSON at the end.
For now, I am using the following logic to generate the hashmap for a column: Dataset<Row> freq = df .groupBy(columnName) .count(); HashMap<String, String> myHashMap = new HashMap<>(); Iterator<Row> rowIterator = freq.toLocalIterator(); while(rowIterator.hasNext()) { Row currRow = rowIterator.next(); String rowString = currRow.toString(); String[] contents = rowString.substring(1, rowString.length() - 1).split(","); Double percent = Long.valueOf(contents[1])*100.0/numOfRows; myHashMap.put(contents[0], Double.toString(percent)); } I have also tried converting to RDD and using the collectAsMap() function, but both of these are taking a very long time (about 5 minutes per column, where each column has approx. 30 million rows). Is there a more efficient way to achieve the same? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org