[ https://issues.apache.org/jira/browse/SPARK-38833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Enrico Minack updated SPARK-38833: ---------------------------------- Summary: PySpark applyInPandas should allow to return empty DataFrame without columns (was: PySpark allows applyInPandas return empty DataFrame without columns) > PySpark applyInPandas should allow to return empty DataFrame without columns > ---------------------------------------------------------------------------- > > Key: SPARK-38833 > URL: https://issues.apache.org/jira/browse/SPARK-38833 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.4.0 > Reporter: Enrico Minack > Priority: Major > > Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises > an error: > {noformat} > RuntimeError: Number of columns of the returned pandas.DataFrame doesn't > match specified schema. Expected: 2 Actual: 0 > {noformat} > Here is an example: > {code} > import pandas as pd > from pyspark.sql.functions import pandas_udf, ceil > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > def mean_func(key, pdf): > if key == (1,): > return pd.DataFrame([]) > else: > return pd.DataFrame([key + (pdf.v.mean(),)]) > df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show() > {code} > Since the schema is defined when calling {{applyInPandas()}}, it looks > redundant to define the columns when returning an empty {{pd.DataFrame}}. > Returning a non-empty DataFrame does not require defining columns, so > returning an empty DataFrame shouldn't require that either. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org