Re: [pyspark] dataframe map_partition

2019-03-08 Thread peng yu
ver, note that it will be > turned into python dict because pandas itself does not have native struct > type. > On Fri, Mar 8, 2019 at 2:55 PM peng yu wrote: > >> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF >> support input is a StructType too ? >>

Re: [pyspark] dataframe map_partition

2019-03-08 Thread peng yu
s.apache.org/jira/browse/SPARK-23836. Is > that the functionality you are looking for? > > Bryan > > On Thu, Mar 7, 2019 at 1:13 PM peng yu wrote: > >> right now, i'm using the colums-at-a-time mapping >> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
pandas DataFrame. Is what > you're doing vectorized? may not help much. > Just make the pandas Series into a DataFrame if you want? and a single > col back to Series? > > On Thu, Mar 7, 2019 at 2:45 PM peng yu wrote: > > > > pandas/arrow is for the memory efficiency, and

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
2019 at 2:03 PM peng yu wrote: > > > > I'm looking for a mapPartition(pandas_udf) for a pyspark.Dataframe. > > > > ``` > > @pandas_udf(df.schema, PandasUDFType.MAP) > > def do_nothing(pandas_df): > > return pandas_df > > > > > > new_df

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
and in this case, i'm actually benefiting from the columns of arrow support, so that i can pass the whole data block to tensorflow to obtain the block of prediction all at once. On Thu, Mar 7, 2019 at 3:45 PM peng yu wrote: > pandas/arrow is for the memory efficiency, and mapPartitions is o

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
vailable if you want to transform an iterator of Row to another > iterator of Row. > > On Thu, Mar 7, 2019 at 2:33 PM peng yu wrote: > > > > it is very similar to SCALAR, but for SCALAR the output can't be > struct/row and the input has to be pd.Series, which doesn't support a

Re: [pyspark] dataframe map_partition

2019-03-07 Thread peng yu
, 2019 at 2:57 PM Sean Owen wrote: > Are you looking for @pandas_udf in Python? Or just mapPartition? Those > exist already > > On Thu, Mar 7, 2019, 1:43 PM peng yu wrote: > >> There is a nice map_partition function in R `dapply`. so that user can >> pass a row to udf.

[pyspark] dataframe map_partition

2019-03-07 Thread peng yu
There is a nice map_partition function in R `dapply`. so that user can pass a row to udf. I'm wondering why we don't have that in python? I'm trying to have a map_partition function with pandas_udf supported thanks!