@Enrico Minack <enrico-min...@gmx.de> Thanks for "unpivot" but I am using version 3.3.0 (you are taking it way too far as usual :) ) @Sean Owen <sro...@gmail.com> Pls then show me how it can be improved by code.
Also, why such an approach (using withColumn() ) doesn't work: for (String columnName : df.columns()) { df= df.withColumn(columnName, df.select(columnName).distinct().col(columnName)); } Le sam. 11 févr. 2023 à 13:11, Enrico Minack <i...@enrico.minack.dev> a écrit : > You could do the entire thing in DataFrame world and write the result to > disk. All you need is unpivot (to be released in Spark 3.4.0, soon). > > Note this is Scala but should be straightforward to translate into Java: > > import org.apache.spark.sql.functions.collect_set > > val df = Seq((1, 10, 123), (2, 20, 124), (3, 20, 123), (4, 10, > 123)).toDF("a", "b", "c") > > df.unpivot(Array.empty, "column", "value") > .groupBy("column") > .agg(collect_set("value").as("distinct_values")) > > The unpivot operation turns > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 10|123| > | 2| 20|124| > | 3| 20|123| > | 4| 10|123| > +---+---+---+ > > into > > +------+-----+ > |column|value| > +------+-----+ > | a| 1| > | b| 10| > | c| 123| > | a| 2| > | b| 20| > | c| 124| > | a| 3| > | b| 20| > | c| 123| > | a| 4| > | b| 10| > | c| 123| > +------+-----+ > > The groupBy("column").agg(collect_set("value").as("distinct_values")) > collects distinct values per column: > +------+---------------+ > > |column|distinct_values| > +------+---------------+ > | c| [123, 124]| > | b| [20, 10]| > | a| [1, 2, 3, 4]| > +------+---------------+ > > Note that unpivot only works if all columns have a "common" type. Then all > columns are cast to that common type. If you have incompatible types like > Integer and String, you would have to cast them all to String first: > > import org.apache.spark.sql.types.StringType > > df.select(df.columns.map(col(_).cast(StringType)): _*).unpivot(...) > > If you want to preserve the type of the values and have multiple value > types, you cannot put everything into a DataFrame with one distinct_values > column. You could still have multiple DataFrames, one per data type, and > write those, or collect the DataFrame's values into Maps: > > import scala.collection.immutable > > import org.apache.spark.sql.DataFrame > import org.apache.spark.sql.functions.collect_set > > // if all you columns have the same type > def distinctValuesPerColumnOneType(df: DataFrame): immutable.Map[String, > immutable.Seq[Any]] = { > df.unpivot(Array.empty, "column", "value") > .groupBy("column") > .agg(collect_set("value").as("distinct_values")) > .collect() > .map(row => row.getString(0) -> row.getSeq[Any](1).toList) > .toMap > } > > > // if your columns have different types > def distinctValuesPerColumn(df: DataFrame): immutable.Map[String, > immutable.Seq[Any]] = { > df.schema.fields > .groupBy(_.dataType) > .mapValues(_.map(_.name)) > .par > .map { case (dataType, columns) => df.select(columns.map(col): _*) } > .map(distinctValuesPerColumnOneType) > .flatten > .toList > .toMap > } > > val df = Seq((1, 10, "one"), (2, 20, "two"), (3, 20, "one"), (4, 10, > "one")).toDF("a", "b", "c") > distinctValuesPerColumn(df) > > The result is: (list values are of original type) > Map(b -> List(20, 10), a -> List(1, 2, 3, 4), c -> List(one, two)) > > Hope this helps, > Enrico > > > Am 10.02.23 um 22:56 schrieb sam smith: > > Hi Apotolos, > Can you suggest a better approach while keeping values within a dataframe? > > Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos < > papad...@csd.auth.gr> a écrit : > >> Dear Sam, >> >> you are assuming that the data fits in the memory of your local machine. >> You are using as a basis a dataframe, which potentially can be very large, >> and then you are storing the data in local lists. Keep in mind that that >> the number of distinct elements in a column may be very large (depending on >> the app). I suggest to work on a solution that assumes that the number of >> distinct values is also large. Thus, you should keep your data in >> dataframes or RDDs, and store them as csv files, parquet, etc. >> >> a.p. >> >> >> On 10/2/23 23:40, sam smith wrote: >> >> I want to get the distinct values of each column in a List (is it good >> practice to use List here?), that contains as first element the column >> name, and the other element its distinct values so that for a dataset we >> get a list of lists, i do it this way (in my opinion no so fast): >> >> List<List<String>> finalList = new ArrayList<List<String>>(); >> Dataset<Row> df = spark.read().format("csv").option("header", >> "true").load("/pathToCSV"); >> String[] columnNames = df.columns(); >> for (int i=0;i<columnNames.length;i++) { >> List<String> columnList = new ArrayList<String>(); >> >> columnList.add(columnNames[i]); >> >> >> List<Row> columnValues = >> df.filter(org.apache.spark.sql.functions.col(columnNames[i]).isNotNull()).select(columnNames[i]).distinct().collectAsList(); >> for (int j=0;j<columnValues.size();j++) >> columnList.add(columnValues.get(j).apply(0).toString()); >> >> finalList.add(columnList); >> >> >> How to improve this? >> >> Also, can I get the results in JSON format? >> >> -- >> Apostolos N. Papadopoulos, Associate Professor >> Department of Informatics >> Aristotle University of Thessaloniki >> Thessaloniki, GREECE >> tel: ++0030312310991918 >> email: papad...@csd.auth.gr >> twitter: @papadopoulos_ap >> web: http://datalab.csd.auth.gr/~apostol >> >> >