Use Spark Aggregator in PySpark

2023-04-23 Thread Thomas Wang
Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Now I'm trying to use it in a PySpark application, but for some reason, I'm not able to trigger the function. Here is what I'm doing, could someone help me take a look?

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
should work. > -- > Raghavendra > > > On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > >> Hi Spark Community, >> >> I'm trying to implement a custom Spark Aggregator (a subclass to >> org.apache.spark.sql.expressions.Aggregator). Correct me if I'm

Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Hi Spark Community, I'm trying to implement a custom Spark Aggregator (a subclass to org.apache.spark.sql.expressions.Aggregator). Correct me if I'm wrong, but I'm assuming I will be able to use it as an aggregation function like SUM. What I'm trying to do is that I have a column of ARRAY and I

eqNullSafe breaks Sorted Merge Bucket Join?

2023-03-09 Thread Thomas Wang
Hi, I have two tables t1 and t2. Both are bucketed and sorted on user_id into 32 buckets. When I use a regular equal join, Spark triggers the expected Sorted Merge Bucket Join. Please see my code and the physical plan below. from pyspark.sql import SparkSession def