a numPartitions column. What other options can I explore?
--gautham
-Original Message-
From: ZHANG Wei
Sent: Thursday, May 7, 2020 1:34 AM
To: Gautham Acharya
Cc: user@spark.apache.org
Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG()
CAUTION: This email originated from outside
AFAICT, there might be data skews, some partitions got too much rows,
which caused out of memory limitation. Trying .groupBy().count()
or .aggregateByKey().count() may help check each partition data size.
If no data skew, to increase .groupBy() parameter `numPartitions` is
worth a try.
--
Hi everyone,
I'm running a job that runs a Pandas UDF to GROUP BY a large matrix.
The GROUP BY function runs on a wide dataset. The first column of the dataset
contains string labels that are GROUPed on. The remaining columns are numeric
values that are aggregated in the Pandas UDF. The