Josh created SPARK-12774: ---------------------------- Summary: DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows Key: SPARK-12774 URL: https://issues.apache.org/jira/browse/SPARK-12774 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh
Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in both Spark and pySpark. The function that is applied to each partition _f_ must operate on a list generator. This is however very inefficient in Python. It would be more logical and efficient if the apply function _f_ operated on Pandas DataFrames instead and also returned a DataFrame. This avoids unnecessary iteration in Python which is slow. Currently: {code:python} def apply_function(rows): df = pd.DataFrame(list(rows)) df = df % 100 # Do something on df return df.values.tolist() table = sqlContext.read.parquet("") table = table.mapPatitions(apply_function) {code} New apply function would accept a Pandas DataFrame and return a DataFrame: {code:python} def apply_function(df): df = df % 100 # Do something on df return df {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org