hi Gourav,

> And also be aware that pandas UDF does not always lead to better performance
> and sometimes even massively slow performance.

this information is not widely spread. this is good to know. in which 
circumstances is it worst than regular udf ?

> With Grouped Map dont you run into the risk of random memory errors as well?

indeed, that might append if the batched binaries have surprising high
size.

On Sat, May 04, 2019 at 02:25:34AM +0100, Gourav Sengupta wrote:
> And also be aware that pandas UDF does not always lead to better performance
> and sometimes even massively slow performance.
> 
> With Grouped Map dont you run into the risk of random memory errors as well?
> 
> On Thu, May 2, 2019 at 9:32 PM Bryan Cutler <cutl...@gmail.com> wrote:
> 
>     Hi,
> 
>     BinaryType support was not added until Spark 2.4.0, see https://
>     issues.apache.org/jira/browse/SPARK-23555. Also, pyarrow 0.10.0 or greater
>     is require as you saw in the docs.
> 
>     Bryan
> 
>     On Thu, May 2, 2019 at 4:26 AM Nicolas Paris <nicolas.pa...@riseup.net>
>     wrote:
> 
>         Hi all
> 
>         I am using pySpark 2.3.0 and pyArrow 0.10.0
> 
>         I want to apply a pandas-udf on a dataframe with <String, binaryType>
>         I have the bellow error:
> 
>         > Invalid returnType with grouped map Pandas UDFs:
>         > StructType(List(StructField(filename,StringType,true),StructField
>         (contents,BinaryType,true)))
>         > is not supported
> 
> 
>         I am missing something ?
>         the doc https://spark.apache.org/docs/latest/
>         sql-pyspark-pandas-with-arrow.html#supported-sql-types
>         says pyArrow 0.10 is minimum to handle binaryType
> 
>         here is the code:
> 
>         > from pyspark.sql.functions import pandas_udf, PandasUDFType
>         >
>         > df = sql("select filename, contents from test_binary")
>         >
>         > @pandas_udf("filename String, contents binary", 
>         PandasUDFType.GROUPED_MAP)
>         > def transform_binary(pdf):
>         >     contents = pdf.contents
>         >     return pdf.assign(contents=contents)
>         >
>         > df.groupby("filename").apply(transform_binary).count()
> 
>         Thanks
>         --
>         nicolas
> 
>         ---------------------------------------------------------------------
>         To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to