Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Yong Zhang Fri, 05 May 2023 12:33:54 -0700

Hi, This is on Spark 3.1 environment.

For some reason, I can ONLY do this in Spark SQL, instead of either Scala or 
PySpark environment.


I want to aggregate an array into a Map of element count, within that array, 
but in Spark SQL.
I know that there is an aggregate function available like

aggregate(expr, start, merge [, finish])

But I want to know if this can be done in the Spark SQL only, and:

  *   How to represent an empty Map as "start" element above
  *   How to merge each element (as String type) into Map (as adding count if 
exist in the Map, or add as (element -> 1) as new entry in the Map if not exist)

Like the following example -> 
https://docs.databricks.com/sql/language-manual/functions/aggregate.html

SELECT aggregate(array(1, 2, 3, 4),
                   named_struct('sum', 0, 'cnt', 0),
                   (acc, x) -> named_struct('sum', acc.sum + x, 'cnt', acc.cnt 
+ 1),
                   acc -> acc.sum / acc.cnt) AS avg

I wonder:
select
      aggregate(
      array('a','b','a')),
      map('', 0),
      (acc, x) -> ???
      acc -> acc) as output

How to do the logic after "(acc, x) -> ", so I can output a map of count of 
each element in the array?
I know I can "explode", then groupby + count, but since I have multi array 
columns need to transform, so I want to do more a high order function way, and 
in pure Spark SQL.

Thanks

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

Reply via email to