Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-13 Thread sam smith
Alright, this is the working Java version of it: List listCols = new ArrayList(); > Arrays.asList(dataset.columns()).forEach(column -> { > listCols.add(org.apache.spark.sql.functions.collect_set(column)); }); > Column[] arrCols = listCols.toArray(new Column[listCols.size()]); > dataset =

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Enrico Minack
@Sean: This aggregate function does work without an explicit groupBy(): ./spark-3.3.1-bin-hadoop2/bin/spark-shell Spark context Web UI available at http://*:4040 Spark context available as 'sc' (master = local[*], app id = local-1676237726079). Spark session available as 'spark'.

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
It doesn't work because it's an aggregate function. You have to groupBy() (group by nothing) to make that work, but, you can't assign that as a column. Folks those approaches don't make sense semantically in SQL or Spark or anything. They just mean use threads to collect() distinct values for each

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
OK, what do you mean by " do your outer for loop in parallel "? btw this didn't work: for (String columnName : df.columns()) { df= df.withColumn(columnName, collect_set(col(columnName)).as(columnName)); } Le dim. 12 févr. 2023 à 20:36, Enrico Minack a écrit : > That is unfortunate, but

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Enrico Minack
That is unfortunate, but 3.4.0 is around the corner, really! Well, then based on your code, I'd suggest two improvements: - cache your dataframe after reading, this way, you don't read the entire file for each column - do your outer for loop in parallel, then you have N parallel Spark jobs

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Mich Talebzadeh
Hi Sam, I am curious to know the business use case for this solution if any? HTH view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your own risk. Any and all responsibility for any

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
@Sean Correct. But I was hoping to improve my solution even more. Le dim. 12 févr. 2023 à 18:03, Sean Owen a écrit : > That's the answer, except, you can never select a result set into a column > right? you just collect() each of those results. Or, what do you want? I'm > not clear. > > On Sun,

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread Sean Owen
That's the answer, except, you can never select a result set into a column right? you just collect() each of those results. Or, what do you want? I'm not clear. On Sun, Feb 12, 2023 at 10:59 AM sam smith wrote: > @Enrico Minack Thanks for "unpivot" but I am using > version 3.3.0 (you are

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-12 Thread sam smith
@Enrico Minack Thanks for "unpivot" but I am using version 3.3.0 (you are taking it way too far as usual :) ) @Sean Owen Pls then show me how it can be improved by code. Also, why such an approach (using withColumn() ) doesn't work: for (String columnName : df.columns()) { df=

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-11 Thread Enrico Minack
You could do the entire thing in DataFrame world and write the result to disk. All you need is unpivot (to be released in Spark 3.4.0, soon). Note this is Scala but should be straightforward to translate into Java: import org.apache.spark.sql.functions.collect_set val df = Seq((1, 10, 123),

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
Why would csv or a temp table change anything here? You don't need windowing for distinct values either On Fri, Feb 10, 2023, 6:01 PM Mich Talebzadeh wrote: > on top of my head, create a dataframe reading CSV file. > > This is python > > listing_df = >

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Mich Talebzadeh
on top of my head, create a dataframe reading CSV file. This is python listing_df = spark.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load(csv_file) listing_df.printSchema() listing_df.createOrReplaceTempView("temp") ## do your distinct

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I am not sure i understand well " Just need to do the cols one at a time". Plus I think Apostolos is right, this needs a dataframe approach not a list approach. Le ven. 10 févr. 2023 à 22:47, Sean Owen a écrit : > For each column, select only that call and get distinct values. Similar to > what

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
Hi Apotolos, Can you suggest a better approach while keeping values within a dataframe? Le ven. 10 févr. 2023 à 22:47, Apostolos N. Papadopoulos < papad...@csd.auth.gr> a écrit : > Dear Sam, > > you are assuming that the data fits in the memory of your local machine. > You are using as a basis a

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
Hi Sean, "You need to select the distinct values of each col one at a time", how ? Le ven. 10 févr. 2023 à 22:40, Sean Owen a écrit : > That gives you all distinct tuples of those col values. You need to select > the distinct values of each col one at a time. Sure just collect() the > result

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Apostolos N. Papadopoulos
Dear Sam, you are assuming that the data fits in the memory of your local machine. You are using as a basis a dataframe, which potentially can be very large, and then you are storing the data in local lists. Keep in mind that that the number of distinct elements in a column may be very large

Re: How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread Sean Owen
That gives you all distinct tuples of those col values. You need to select the distinct values of each col one at a time. Sure just collect() the result as you do here. On Fri, Feb 10, 2023, 3:34 PM sam smith wrote: > I want to get the distinct values of each column in a List (is it good >

How to improve efficiency of this piece of code (returning distinct column values)

2023-02-10 Thread sam smith
I want to get the distinct values of each column in a List (is it good practice to use List here?), that contains as first element the column name, and the other element its distinct values so that for a dataset we get a list of lists, i do it this way (in my opinion no so fast): List> finalList