RE: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread Gautham Acharya
a numPartitions column. What other options can I explore? --gautham -Original Message- From: ZHANG Wei Sent: Thursday, May 7, 2020 1:34 AM To: Gautham Acharya Cc: user@spark.apache.org Subject: Re: PyArrow Exception in Pandas UDF GROUPEDAGG() CAUTION: This email originated from outside

Re: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread ZHANG Wei
AFAICT, there might be data skews, some partitions got too much rows, which caused out of memory limitation. Trying .groupBy().count() or .aggregateByKey().count() may help check each partition data size. If no data skew, to increase .groupBy() parameter `numPartitions` is worth a try. --

PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-05 Thread Gautham Acharya
Hi everyone, I'm running a job that runs a Pandas UDF to GROUP BY a large matrix. The GROUP BY function runs on a wide dataset. The first column of the dataset contains string labels that are GROUPed on. The remaining columns are numeric values that are aggregated in the Pandas UDF. The