[jira] [Assigned] (SPARK-23578) Add multicolumn support for Binarizer

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23578:


Assignee: (was: Apache Spark)

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23578) Add multicolumn support for Binarizer

2018-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384980#comment-16384980
 ] 

Apache Spark commented on SPARK-23578:
--

User 'tengpeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/20729

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23578) Add multicolumn support for Binarizer

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23578:


Assignee: Apache Spark

> Add multicolumn support for Binarizer
> -
>
> Key: SPARK-23578
> URL: https://issues.apache.org/jira/browse/SPARK-23578
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: Apache Spark
>Priority: Minor
>
> [Spark-20542] added an API that Bucketizer that can bin multiple columns. 
> Based on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23578) Add multicolumn support for Binarizer

2018-03-03 Thread Teng Peng (JIRA)
Teng Peng created SPARK-23578:
-

 Summary: Add multicolumn support for Binarizer
 Key: SPARK-23578
 URL: https://issues.apache.org/jira/browse/SPARK-23578
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Teng Peng


[Spark-20542] added an API that Bucketizer that can bin multiple columns. Based 
on this change, a multicolumn support is added for Binarizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3159) Check for reducible DecisionTree

2018-03-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-3159:


Assignee: Alessandro Solimando

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Alessandro Solimando
>Priority: Minor
> Fix For: 2.4.0
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23569:


Assignee: Apache Spark

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Assignee: Apache Spark
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23569:


Assignee: (was: Apache Spark)

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384871#comment-16384871
 ] 

Apache Spark commented on SPARK-23569:
--

User 'mstewart141' has created a pull request for this issue:
https://github.com/apache/spark/pull/20728

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384857#comment-16384857
 ] 

Stu (Michael Stewart) commented on SPARK-23569:
---

yes, i'll give it a go

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2243) Support multiple SparkContexts in the same JVM

2018-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384687#comment-16384687
 ] 

ASF GitHub Bot commented on SPARK-2243:
---

Github user asfgit closed the pull request at:

https://github.com/apache/predictionio/pull/441


> Support multiple SparkContexts in the same JVM
> --
>
> Key: SPARK-2243
> URL: https://issues.apache.org/jira/browse/SPARK-2243
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 0.7.0, 1.0.0, 1.1.0
>Reporter: Miguel Angel Fernandez Diaz
>Priority: Major
>
> We're developing a platform where we create several Spark contexts for 
> carrying out different calculations. Is there any restriction when using 
> several Spark contexts? We have two contexts, one for Spark calculations and 
> another one for Spark Streaming jobs. The next error arises when we first 
> execute a Spark calculation and, once the execution is finished, a Spark 
> Streaming job is launched:
> {code}
> 14/06/23 16:40:08 ERROR executor.Executor: Exception in task ID 0
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
>   at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
>   at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/06/23 16:40:08 WARN scheduler.TaskSetManager: Loss was due to 
> java.io.FileNotFoundException
> java.io.FileNotFoundException: http://172.19.0.215:47530/broadcast_0
>   at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1624)
>   at 
> org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:156)
>   at 
> org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:56)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at 

[jira] [Assigned] (SPARK-23577) Supports line separator for text datasource

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23577:


Assignee: (was: Apache Spark)

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23577) Supports line separator for text datasource

2018-03-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23577:


Assignee: Apache Spark

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23577) Supports line separator for text datasource

2018-03-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384643#comment-16384643
 ] 

Apache Spark commented on SPARK-23577:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20727

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23577) Supports line separator for text datasource

2018-03-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23577:
-
Description: Same as its parent but this JIRA is specific to text 
datasource.

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21289) Text based formats do not support custom end-of-line delimiters

2018-03-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21289:
-
Affects Version/s: 2.3.0

> Text based formats do not support custom end-of-line delimiters
> ---
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.0
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21289) Text based formats do not support custom end-of-line delimiters

2018-03-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21289:
-
Issue Type: Umbrella  (was: Improvement)

> Text based formats do not support custom end-of-line delimiters
> ---
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23577) Supports line separator for text datasource

2018-03-03 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23577:


 Summary: Supports line separator for text datasource
 Key: SPARK-23577
 URL: https://issues.apache.org/jira/browse/SPARK-23577
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21289) Text based formats do not support custom end-of-line delimiters

2018-03-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21289:
-
Summary: Text based formats do not support custom end-of-line delimiters  
(was: Text and CSV formats do not support custom end-of-line delimiters)

> Text based formats do not support custom end-of-line delimiters
> ---
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21289) Text and CSV formats do not support custom end-of-line delimiters

2018-03-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384633#comment-16384633
 ] 

Hyukjin Kwon commented on SPARK-21289:
--

I am going to make this an umbrella to split the PR up. 

> Text and CSV formats do not support custom end-of-line delimiters
> -
>
> Key: SPARK-21289
> URL: https://issues.apache.org/jira/browse/SPARK-21289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Yevgen Galchenko
>Priority: Minor
>
> Spark csv and text readers always use default CR, LF or CRLF line terminators 
> without an option to configure a custom delimiter.
> Option "textinputformat.record.delimiter" is not being used to set delimiter 
> in HadoopFileLinesReader and can only be set for Hadoop RDD when textFile() 
> is used to read file.
> Possible solution would be to change HadoopFileLinesReader and create 
> LineRecordReader with delimiters specified in configuration. LineRecordReader 
> already supports passing recordDelimiter in its constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2018-03-03 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384618#comment-16384618
 ] 

Liang-Chi Hsieh commented on SPARK-22446:
-

This fix uses an new API  {{asNondeterministic}} of {{UserDefinedFunction. 
}}{{asNondeterministic}} is added since 2.3.0. If we want to backport this fix, 
we need to backport the API too. It is not hard but it involves SQL codes. 
Should we backport it because of this fix?

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.1.2, 2.2.1
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 

[jira] [Commented] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384581#comment-16384581
 ] 

Hyukjin Kwon commented on SPARK-23569:
--

If it's hard to add a test, manual tests will be fine enough but please add it 
in PR description. Would you be interested in submitting a PR?

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23569) pandas_udf does not work with type-annotated python functions

2018-03-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384578#comment-16384578
 ] 

Hyukjin Kwon commented on SPARK-23569:
--

I think it's because of deprecated {{inspect.getargspec}} in Python 3. Can we 
replace it to like {{inspect.signature}} in Python 3?

> pandas_udf does not work with type-annotated python functions
> -
>
> Key: SPARK-23569
> URL: https://issues.apache.org/jira/browse/SPARK-23569
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141 | Revision 
> a0d7949896e70f427e7f3942ff340c9484ff0aab
>Reporter: Stu (Michael Stewart)
>Priority: Major
>
> When invoked against a type annotated function pandas_udf raises:
> `ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them`
>  
> the deprecated `getargsspec` call occurs in `pyspark/sql/udf.py`
> {code:java}
> def _create_udf(f, returnType, evalType):
> if evalType in (PythonEvalType.SQL_SCALAR_PANDAS_UDF,
> PythonEvalType.SQL_GROUPED_MAP_PANDAS_UDF):
> import inspect
> from pyspark.sql.utils import require_minimum_pyarrow_version
> require_minimum_pyarrow_version()
> argspec = inspect.getargspec(f)
> ...{code}
> To reproduce: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> import pandas as pd
> def ok(a: pd.Series,b: pd.Series) -> pd.Series: return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
>  
> ---
> ValueError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b'))
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/functions.py in 
> pandas_udf(f, returnType, functionType)
> 2277 return functools.partial(_create_udf, returnType=return_type, 
> evalType=eval_type)
> 2278 else:
> -> 2279 return _create_udf(f=f, returnType=return_type, evalType=eval_type)
> 2280
> 2281
> /opt/miniconda/lib/python3.6/site-packages/pyspark/sql/udf.py in 
> _create_udf(f, returnType, evalType)
> 44
> 45 require_minimum_pyarrow_version()
> ---> 46 argspec = inspect.getargspec(f)
> 47
> 48 if evalType == PythonEvalType.SQL_SCALAR_PANDAS_UDF and len(argspec.args) 
> == 0 and \
> /opt/miniconda/lib/python3.6/inspect.py in getargspec(func)
> 1043 getfullargspec(func)
> 1044 if kwonlyargs or ann:
> -> 1045 raise ValueError("Function has keyword-only parameters or annotations"
> 1046 ", use getfullargspec() API which can support them")
> 1047 return ArgSpec(args, varargs, varkw, defaults)
> ValueError: Function has keyword-only parameters or annotations, use 
> getfullargspec() API which can support them
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23477) Misleading exception message when union fails due to metadata

2018-03-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23477.
--
Resolution: Cannot Reproduce

I am resolving this as it's not reproduced in the master. It'd be nicer if we 
could find the Jira fixing this and backports if applicable.

> Misleading exception message when union fails due to metadata 
> --
>
> Key: SPARK-23477
> URL: https://issues.apache.org/jira/browse/SPARK-23477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When I have two DF's that are different only in terms of metadata in fields 
> inside a struct - I cannot union them but the error message shows that they 
> are the same:
> {code:java}
> df = spark.createDataFrame([{'a':1}])
> a = df.select(struct('a').alias('x'))
> b = 
> df.select(col('a').alias('a',metadata={'description':'xxx'})).select(struct(col('a')).alias('x'))
> a.union(b).printSchema(){code}
> gives:
> {code:java}
> An error occurred while calling o1076.union.
> : org.apache.spark.sql.AnalysisException: Union can only be performed on 
> tables with the compatible column types. struct <> struct 
> at the first column of the second table{code}
> and this part:
> {code:java}
> struct <> struct{code}
> does not make any sense because those are the same.
>  
> Since metadata must be the same for union -> it should be incuded in the 
> error message



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org