Hi,
as a first attempt I would try to cache "freq", to be sure that the dataset
is not re-loaded at each iteration later on.

Btw, what's the original data format you are importing from?

I suspect also that an appropriate case class rather than Row would help as
well, instead of converting to String and parsing it "manually".

Hth,
Alessandro

On Fri, 28 Sep 2018 at 01:48, rishmanisation <rish.anant...@gmail.com>
wrote:

> I am writing a data-profiling application that needs to iterate over a
> large
> .gz file (imported as a Dataset<Row>). Each key-value pair in the hashmap
> will be the row value and the number of times it occurs in the column.
> There
> is one hashmap for each column, and they are all added to a JSON at the
> end.
>
> For now, I am using the following logic to generate the hashmap for a
> column:
>
> Dataset<Row> freq = df
>                 .groupBy(columnName)
>                 .count();
>
> HashMap<String, String> myHashMap = new HashMap<>();
>
> Iterator<Row> rowIterator = freq.toLocalIterator();
> while(rowIterator.hasNext()) {
>             Row currRow = rowIterator.next();
>             String rowString = currRow.toString();
>             String[] contents = rowString.substring(1, rowString.length() -
> 1).split(",");
>             Double percent = Long.valueOf(contents[1])*100.0/numOfRows;
>             myHashMap.put(contents[0], Double.toString(percent));
> }
>
> I have also tried converting to RDD and using the collectAsMap() function,
> but both of these are taking a very long time (about 5 minutes per column,
> where each column has approx. 30 million rows). Is there a more efficient
> way to achieve the same?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to