[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-10-02 Thread Stu (Michael Stewart) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635811#comment-16635811
 ] 

Stu (Michael Stewart) commented on SPARK-25461:
---

Thanks all; to me the largest issue with this behavior is the silent failure — 
there is a relatively sane workaround to the issue but the silent failure is 
deeply unnerving. Especially as the same code runs in pandas proper with no 
hint of the issue. Even raising some runtime error would be a huge win from my 
perspective! 

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

2018-09-24 Thread Stu (Michael Stewart) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626198#comment-16626198
 ] 

Stu (Michael Stewart) commented on SPARK-25461:
---

[~hyukjin.kwon] I have also recently observed this issue – it seems potentially 
rather serious. This is quite different than the behaviorr of the pandas 
function against a (pandas) df, which works exactly as expected.

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> 
>
> Key: SPARK-25461
> URL: https://issues.apache.org/jira/browse/SPARK-25461
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>Reporter: Chongyuan Xiang
>Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 3 + [1.0] * 17 + [2.0] * 600
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
> return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
> .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-+---+
> | A|potential_bad_col|correct_col|
> +---+-+---+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-+---+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF

2018-06-27 Thread Stu (Michael Stewart) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525363#comment-16525363
 ] 

Stu (Michael Stewart) commented on SPARK-24208:
---

I can confirm this does not work on 2.3.1.

> Cannot resolve column in self join after applying Pandas UDF
> 
>
> Key: SPARK-24208
> URL: https://issues.apache.org/jira/browse/SPARK-24208
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: AWS EMR 5.13.0
> Amazon Hadoop distribution 2.8.3
> Spark 2.3.0
> Pandas 0.22.0
>Reporter: Rafal Ganczarek
>Priority: Minor
>
> I noticed that after applying Pandas UDF function, a self join of resulted 
> DataFrame will fail to resolve columns. The workaround that I found is to 
> recreate DataFrame with its RDD and schema.
> Below you can find a Python code that reproduces the issue.
> {code:java}
> from pyspark import Row
> import pyspark.sql.functions as F
> @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP)
> def dummy_pandas_udf(df):
> return df[['key','col']]
> df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), 
> Row(key=2,col='C')])
> # transformation that causes the issue
> df = df.groupBy('key').apply(dummy_pandas_udf)
> # WORKAROUND that fixes the issue
> # df = spark.createDataFrame(df.rdd, df.schema)
> df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> {code}
> If workaround line is commented out, then above code fails with the following 
> error:
> {code:java}
> AnalysisExceptionTraceback (most recent call last)
>  in ()
>  12 # df = spark.createDataFrame(df.rdd, df.schema)
>  13 
> ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how)
> 929 on = self._jseq([])
> 930 assert isinstance(how, basestring), "how should be 
> basestring"
> --> 931 jdf = self._jdf.join(other._jdf, on, how)
> 932 return DataFrame(jdf, self.sql_ctx)
> 933 
> /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, 
> *args)
>1158 answer = self.gateway_client.send_command(command)
>1159 return_value = get_return_value(
> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>1161 
>1162 for temp_arg in temp_args:
> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u"cannot resolve '`temp0.key`' given input columns: 
> [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- 
> AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n:   +- Project [key#4099L, col#4098, 
> key#4099L]\n:  +- LogicalRDD [col#4098, key#4099L], false\n+- 
> AnalysisBarrier\n  +- SubqueryAlias temp1\n +- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, 
> key#4099L]\n   +- LogicalRDD [col#4098, key#4099L], false\n"
> {code}
> The same happens, if instead of DataFrame API I use Spark SQL to do a self 
> join:
> {code:java}
> # df is a DataFrame after applying dummy_pandas_udf
> df.createOrReplaceTempView('df')
> spark.sql('''
> SELECT 
> *
> FROM df temp0
> LEFT JOIN df temp1 ON
> temp0.key == temp1.key
> ''').show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24208) Cannot resolve column in self join after applying Pandas UDF

2018-06-26 Thread Stu (Michael Stewart) (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524341#comment-16524341
 ] 

Stu (Michael Stewart) commented on SPARK-24208:
---

[~hyukjin.kwon] I can confirm I ran into this issue too. The issue, as the OP 
noted, stems from having a pandas GROUPED_MAP UDF applied to a DF prior to 
attempting a self-join of said DF against itself. Beyond that I've not 
investigated.

> Cannot resolve column in self join after applying Pandas UDF
> 
>
> Key: SPARK-24208
> URL: https://issues.apache.org/jira/browse/SPARK-24208
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: AWS EMR 5.13.0
> Amazon Hadoop distribution 2.8.3
> Spark 2.3.0
> Pandas 0.22.0
>Reporter: Rafal Ganczarek
>Priority: Minor
>
> I noticed that after applying Pandas UDF function, a self join of resulted 
> DataFrame will fail to resolve columns. The workaround that I found is to 
> recreate DataFrame with its RDD and schema.
> Below you can find a Python code that reproduces the issue.
> {code:java}
> from pyspark import Row
> import pyspark.sql.functions as F
> @F.pandas_udf('key long, col string', F.PandasUDFType.GROUPED_MAP)
> def dummy_pandas_udf(df):
> return df[['key','col']]
> df = spark.createDataFrame([Row(key=1,col='A'), Row(key=1,col='B'), 
> Row(key=2,col='C')])
> # transformation that causes the issue
> df = df.groupBy('key').apply(dummy_pandas_udf)
> # WORKAROUND that fixes the issue
> # df = spark.createDataFrame(df.rdd, df.schema)
> df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> {code}
> If workaround line is commented out, then above code fails with the following 
> error:
> {code:java}
> AnalysisExceptionTraceback (most recent call last)
>  in ()
>  12 # df = spark.createDataFrame(df.rdd, df.schema)
>  13 
> ---> 14 df.alias('temp0').join(df.alias('temp1'), F.col('temp0.key') == 
> F.col('temp1.key')).show()
> /usr/lib/spark/python/pyspark/sql/dataframe.py in join(self, other, on, how)
> 929 on = self._jseq([])
> 930 assert isinstance(how, basestring), "how should be 
> basestring"
> --> 931 jdf = self._jdf.join(other._jdf, on, how)
> 932 return DataFrame(jdf, self.sql_ctx)
> 933 
> /usr/lib/spark/python/lib/py4j-src.zip/py4j/java_gateway.py in __call__(self, 
> *args)
>1158 answer = self.gateway_client.send_command(command)
>1159 return_value = get_return_value(
> -> 1160 answer, self.gateway_client, self.target_id, self.name)
>1161 
>1162 for temp_arg in temp_args:
> /usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u"cannot resolve '`temp0.key`' given input columns: 
> [temp0.key, temp0.col];;\n'Join Inner, ('temp0.key = 'temp1.key)\n:- 
> AnalysisBarrier\n: +- SubqueryAlias temp0\n:+- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n:   +- Project [key#4099L, col#4098, 
> key#4099L]\n:  +- LogicalRDD [col#4098, key#4099L], false\n+- 
> AnalysisBarrier\n  +- SubqueryAlias temp1\n +- 
> FlatMapGroupsInPandas [key#4099L], dummy_pandas_udf(col#4098, key#4099L), 
> [key#4104L, col#4105]\n+- Project [key#4099L, col#4098, 
> key#4099L]\n   +- LogicalRDD [col#4098, key#4099L], false\n"
> {code}
> The same happens, if instead of DataFrame API I use Spark SQL to do a self 
> join:
> {code:java}
> # df is a DataFrame after applying dummy_pandas_udf
> df.createOrReplaceTempView('df')
> spark.sql('''
> SELECT 
> *
> FROM df temp0
> LEFT JOIN df temp1 ON
> temp0.key == temp1.key
> ''').show()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-18 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404126#comment-16404126
 ] 

Stu (Michael Stewart) commented on SPARK-23645:
---

{quote}Sounds a good to do if the change is minimal but if the change is big, I 
doubt if this is something we should support. Documenting this might be good 
enough for now.
{quote}
Definitely a nontrivial change after digging all the way down. I've updated PR.

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-12 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395390#comment-16395390
 ] 

Stu (Michael Stewart) edited comment on SPARK-23645 at 3/12/18 3:28 PM:


[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. this would be a clear functionality 
improvement over present for quite few loc.

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

[https://github.com/apache/spark/pull/20798]


was (Author: mstewart141):
[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. 

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

https://github.com/apache/spark/pull/20798

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', 

[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-12 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395390#comment-16395390
 ] 

Stu (Michael Stewart) commented on SPARK-23645:
---

[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. 

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

https://github.com/apache/spark/pull/20798

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper

---
TypeError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, 
returnType='bigint')(a='id',b='b')).show()

TypeError: wrapper() got an unexpected keyword argument 'a'{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. so, the challenge here is to:

(a) make sure to reconstruct the proper order of the full args/kwargs

--> args first, and then kwargs (not in the order passed but in the order 
requested by the fn)

(b) handle python2 and python3 `inspect` module inconsistencies 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, 

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
 

 

*discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. so, the challenge here is to:

(a) make sure to reconstruct the proper order of the full args/kwargs

--> args first, and then kwargs (not in the order passed but in the order 
requested by the fn)

(b) handle python2 and python3 `inspect` module inconsistencies 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
discourse: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. 


> pandas_udf can not be called with keyword arguments
> ---
>
> Key: 

[jira] [Updated] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23645:
--
Description: 
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}
as seen in:
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems
df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')(a='id',b='b')).show() 
 # fail with ~no stacktrace thanks to wrapper helper{code}
discourse: it isn't difficult to swap back in the kwargs, allowing the UDF to 
be called as such, but the cols tuple that gets passed in the call method:
{code:java}
_to_seq(sc, cols, _to_java_column{code}
 has to be in the right order based on the functions defined argument inputs, 
or the function will return incorrect results. 

  was:
pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}


> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap 

[jira] [Created] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-10 Thread Stu (Michael Stewart) (JIRA)
Stu (Michael Stewart) created SPARK-23645:
-

 Summary: pandas_udf can not be called with keyword arguments
 Key: SPARK-23645
 URL: https://issues.apache.org/jira/browse/SPARK-23645
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.3.0
 Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
OpenJDK 64-Bit Server VM, 1.8.0_141
Reporter: Stu (Michael Stewart)


pandas_udf (all python udfs(?)) do not accept keyword arguments because 
`pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also wrapper 
utility methods, that only accept args and not kwargs:

@ line 168:
{code:java}
...

def __call__(self, *cols):
judf = self._judf
sc = SparkContext._active_spark_context
return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))

# This function is for improving the online help system in the interactive 
interpreter.
# For example, the built-in help / pydoc.help. It wraps the UDF with the 
docstring and
# argument annotation. (See: SPARK-19161)
def _wrapped(self):
"""
Wrap this udf with a function and attach docstring from func
"""

# It is possible for a callable instance without __name__ attribute or/and
# __module__ attribute to be wrapped here. For example, functools.partial. 
In this case,
# we should avoid wrapping the attributes from the wrapped function to the 
wrapper
# function. So, we take out these attribute names from the default names to 
set and
# then manually assign it after being wrapped.
assignments = tuple(
a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
'__module__')

@functools.wraps(self.func, assigned=assignments)
def wrapper(*args):
return self(*args)

...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384857#comment-16384857
 ] 

Stu (Michael Stewart) commented on SPARK-23569:
---

yes, i'll give it a go

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23569:
--
Description: 
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
{code:java}
def _create_udf(f, returnType, evalType):

if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
import inspect
from pyspark.sql.utils import require_minimum_pyarrow_version

require_minimum_pyarrow_version()
argspec = inspect.getargspec(f)

...{code}
To reproduce: 
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems

import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2280
2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
44
45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
47
48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 
0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
1043 getfullargspec(func)
1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
1046 ", use getfullargspec() API which can support them")
1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them
{code}

  was:
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce: 
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems

import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2280
2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
44
45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
47
48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 
0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
1043 getfullargspec(func)
1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
1046 ", use getfullargspec() API which can support them")
1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

{code}


> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> 

[jira] [Updated] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23569:
--
Description: 
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce:

 
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems

import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2280
2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
44
45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
47
48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 
0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
1043 getfullargspec(func)
1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
1046 ", use getfullargspec() API which can support them")
1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

{code}

  was:
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce:

```

from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems



import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
 2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
 2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
 2280
 2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
 44
 45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
 47
 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
== 0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
 1043 getfullargspec(func)
 1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
 1046 ", use getfullargspec() API which can support them")
 1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

```

 


> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> To reproduce:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = 

[jira] [Updated] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23569:
--
Description: 
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce: 
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems

import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2280
2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
44
45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
47
48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 
0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
1043 getfullargspec(func)
1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
1046 ", use getfullargspec() API which can support them")
1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

{code}

  was:
When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce:

 
{code:java}
from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems

import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
2280
2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
44
45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
47
48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) == 
0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
1043 getfullargspec(func)
1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
1046 ", use getfullargspec() API which can support them")
1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

{code}


> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = 

[jira] [Updated] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23569:
--
Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
a0d7949896e70f427e7f3942ff340c9484ff0aab  (was: pyspark 2.3.0 | Using Scala 
version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
a0d7949896e70f427e7f3942ff340c9484ff0aab)

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> To reproduce:
> ```
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
>  2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
>  2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
>  2280
>  2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
>  44
>  45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
>  47
>  48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
>  1043 getfullargspec(func)
>  1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
>  1046 ", use getfullargspec() API which can support them")
>  1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> ```
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu (Michael Stewart) updated SPARK-23569:
--
Environment: pyspark 2.3.0 | Using Scala version 2.11.8, OpenJDK 64-Bit 
Server VM, 1.8.0_141 | Revision a0d7949896e70f427e7f3942ff340c9484ff0aab  (was: 
pyspark 2.3.0

 

```

pyspark --version
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
 /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_141
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.

```)

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: pyspark 2.3.0 | Using Scala version 2.11.8, OpenJDK 
> 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> To reproduce:
> ```
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
>  2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
>  2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
>  2280
>  2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
>  44
>  45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
>  47
>  48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
>  1043 getfullargspec(func)
>  1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
>  1046 ", use getfullargspec() API which can support them")
>  1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> ```
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-02 Thread Stu (Michael Stewart) (JIRA)
Stu (Michael Stewart) created SPARK-23569:
-

 Summary: pandas_udf does not work with type-annotated python 
functions
 Key: SPARK-23569
 URL: https://issues.apache.org/jira/browse/SPARK-23569
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: pyspark 2.3.0

 

```

pyspark --version
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.3.0
 /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_141
Branch master
Compiled by user sameera on 2018-02-22T19:24:29Z
Revision a0d7949896e70f427e7f3942ff340c9484ff0aab
Url g...@github.com:sameeragarwal/spark.git
Type --help for more information.

```
Reporter: Stu (Michael Stewart)


When invoked against a type annotated function pandas_udf raises:

`ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them`

 

To reproduce:

```

from pyspark.sql import SparkSession

from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit

spark = SparkSession.builder.getOrCreate()

df = spark.range(12).withColumn('b', col('id') * 2)

def ok(a,b): return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  # 
no problems



import pandas as pd

def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b

df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

 

---
ValueError Traceback (most recent call last)
 in ()
> 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
pandas_udf(f, returnType, functionType)
 2277 return functools.partial(_create_udf, returnType=return_type, 
evalType=eval_type)
 2278 else:
-> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
 2280
 2281

/opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in _create_udf(f, 
returnType, evalType)
 44
 45 require_minimum_pyarrow_version()
---> 46 argspec = inspect.getargspec(f)
 47
 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
== 0 and \

/opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
 1043 getfullargspec(func)
 1044 if kwonlyargs or ann:
-> 1045 raise ValueError("Function has keyword-only parameters or annotations"
 1046 ", use getfullargspec() API which can support them")
 1047 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use 
getfullargspec() API which can support them

```

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org