[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413341#comment-16413341
 ] 

Hyukjin Kwon commented on SPARK-23645:
--

This was fixed by just simply documenting the limitation but I think it might 
be worth to revisit and allow this case.

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Assignee: Stu (Michael Stewart)
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23645:


Assignee: Stu (Michael Stewart)

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Assignee: Stu (Michael Stewart)
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23645.
--
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20900
[https://github.com/apache/spark/pull/20900]

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Assignee: Stu (Michael Stewart)
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23700) Cleanup unused imports

2018-03-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23700.
--
   Resolution: Fixed
 Assignee: Bryan Cutler
Fix Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/20892

> Cleanup unused imports
> --
>
> Key: SPARK-23700
> URL: https://issues.apache.org/jira/browse/SPARK-23700
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> I've noticed a fair amount of unused imports in pyspark, I'll take a look 
> through and try to clean them up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-25 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413234#comment-16413234
 ] 

Darek commented on SPARK-23710:
---

It passed all the tests and we need it for Hadoop 3.0, we need to merge the PR 
ASAP.

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date

2018-03-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23549.
-
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.4.0

> Spark SQL unexpected behavior when comparing timestamp to date
> --
>
> Key: SPARK-23549
> URL: https://issues.apache.org/jira/browse/SPARK-23549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dong Jiang
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> {code:java}
> scala> spark.version
> res1: String = 2.2.1
> scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
> cast('2017-02-28' as date) and cast('2017-03-01' as date)").show
> +---+
> |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
> CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 
> AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|
> +---+
> |                                                                             
>                                                                               
>                                                false|
> +---+{code}
> As shown above, when a timestamp is compared to date in SparkSQL, both 
> timestamp and date are downcast to string, and leading to unexpected result. 
> If run the same SQL in presto/Athena, I got the expected result
> {code:java}
> select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
> date) and cast('2017-03-01' as date)
>   _col0
> 1 true
> {code}
> Is this a bug for Spark or a feature?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23598) WholeStageCodegen can lead to IllegalAccessError calling append for HashAggregateExec

2018-03-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413213#comment-16413213
 ] 

Herman van Hovell commented on SPARK-23598:
---

[~dongjoon] thanks for doing this. As for the failing tests, I have backported 
the hotfix to branch-2.3.

> WholeStageCodegen can lead to IllegalAccessError  calling append for 
> HashAggregateExec
> --
>
> Key: SPARK-23598
> URL: https://issues.apache.org/jira/browse/SPARK-23598
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: David Vogelbacher
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Got the following stacktrace for a large QueryPlan using WholeStageCodeGen:
> {noformat}
> java.lang.IllegalAccessError: tried to access method 
> org.apache.spark.sql.execution.BufferedRowIterator.append(Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  from class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7$agg_NestedClass.agg_doAggregateWithKeysOutput$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage7.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345){noformat}
> After disabling codegen, everything works.
> The root cause seems to be that we are trying to call the protected _append_ 
> method of 
> [BufferedRowIterator|https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/BufferedRowIterator.java#L68]
>  from an inner-class of a sub-class that is loaded by a different 
> class-loader (after codegen compilation).
> [https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-5.html#jvms-5.4.4] 
> states that a protected method _R_ can be accessed only if one of the 
> following two conditions is fulfilled:
>  # R is protected and is declared in a class C, and D is either a subclass of 
> C or C itself. Furthermore, if R is not static, then the symbolic reference 
> to R must contain a symbolic reference to a class T, such that T is either a 
> subclass of D, a superclass of D, or D itself.
>  # R is either protected or has default access (that is, neither public nor 
> protected nor private), and is declared by a class in the same run-time 
> package as D.
> 2.) doesn't apply as we have loaded the class with a different class loader 
> (and are in a different package) and 1.) doesn't apply because we are 
> apparently trying to call the method from an inner class of a subclass of 
> _BufferedRowIterator_.
> Looking at the Code path of _WholeStageCodeGen_, the following happens:
>  # In 
> [WholeStageCodeGen|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L527],
>  we create the subclass of _BufferedRowIterator_, along with a _processNext_ 
> method for processing the output of the child plan.
>  # In the child, which is a 
> [HashAggregateExec|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L517],
>  we create the method which shows up at the top of the stack trace (called 
> _doAggregateWithKeysOutput_ )
>  # We add this method to the compiled code invoking _addNewFunction_ of 
> [CodeGenerator|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L460]
> In the generated function body we call the _append_ method.|
> Now, the _addNewFunction_ method states that:
> {noformat}
> If the code for the `OuterClass` grows too large, the function will be 
> inlined into a new private, inner class
> {noformat}
> This indeed seems to happen: the _doAggregateWithKeysOutput_ method is put 
> into a new private inner class. Thus, it doesn't have access to the protected 
> _append_ method anymore but still tries to call it, which results in the 

[jira] [Commented] (SPARK-23740) Add FPGrowth Param for filtering out very common items

2018-03-25 Thread Teng Peng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413199#comment-16413199
 ] 

Teng Peng commented on SPARK-23740:
---

I suppose `beforehand` means before itemsets been generated, correct?

If so, it seems we have two approaches here:
 # Add a new filter condition in `genFreqItems`, but this is in MLlib, not ML.
 # Filter input dataset before we call mllibFP. Then we will have implement a 
similar method like `genFreqItems` in MLlib. Does this look good to you?

> Add FPGrowth Param for filtering out very common items
> --
>
> Key: SPARK-23740
> URL: https://issues.apache.org/jira/browse/SPARK-23740
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> It would be handy to have a Param in FPGrowth for filtering out very common 
> items.  This is from a use case where the dataset had items appearing in 
> 99.9%+ of the rows.  These common items were useless, but they caused the 
> algorithm to generate many unnecessary itemsets.  Filtering useless common 
> items beforehand can make the algorithm much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-03-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23167.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20839) Incorrect Dynamic PageRank calculation

2018-03-25 Thread Ziyan LIANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413032#comment-16413032
 ] 

Ziyan LIANG commented on SPARK-20839:
-

Got the same confusion, is there some documents or reference about this design 
because it's quite different from the original theory description. Thanks.

> Incorrect Dynamic PageRank calculation
> --
>
> Key: SPARK-20839
> URL: https://issues.apache.org/jira/browse/SPARK-20839
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.1
>Reporter: BahaaEddin AlAila
>Priority: Major
>
> Correct me if I am wrong
> I think there are three places where the pagerank calculation is incorrect
> 1st) in the VertexProgram (line 318 of PageRank.scala in spark 2.1.1)
> val newPR = oldPR + (1.0 - resetProb) * msgSum
> it should be
> val newPR = resetProb + (1.0 - resetProb) * msgSum
> 2nd) in the message sending part (line 336 of the same file)
> Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
> should be 
> Iterator((edge.dstId, edge.srcAttr._1 * edge.attr))
> as we should be sending the edge weight multiplied by the current pagerank of 
> the source vertex (not the vertex's delta)
> 3rd) the tol check against the abs of the delta (line 335)
>   if (edge.srcAttr._2 > tol) {
> should be
>   if (Math.abs(edge.srcAttr._2) > tol) {
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-03-25 Thread Sujith Jay Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16413013#comment-16413013
 ] 

Sujith Jay Nair commented on SPARK-23437:
-

+1 for the this initiative. To garner support for this initiative, we need to 
come up with strong reasons why GPs are needed as part of Spark ML. This could 
be done as part of the documentation of your implementation. 
You do mention GPflow as an example of the TensorFlow ecosystem supporting 
linear-time GPs; however, that still is a third-party library. If anything, it 
vouches for the opinion that this functionality should be kept separate from 
core Spark ML. Like Seth Henderson mentions above, it would help tremendously 
to showcase more packages which have this algo implemented.

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-25 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412974#comment-16412974
 ] 

Stavros Kontopoulos edited comment on SPARK-23790 at 3/25/18 10:36 AM:
---

[~q79969786] I see the PRs you created to fix the other PR, btw the 
doAsRealUser does the work:

 
{quote}18/03/23 19:26:18 DEBUG UserGroupInformation: PrivilegedAction 
as:hive@LOCAL (auth:KERBEROS) 
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)

18/03/23 19:26:18 DEBUG TSaslTransport: opening transport 
org.apache.thrift.transport.TSaslClientTransport@64201482
 18/03/23 19:26:18 DEBUG TSaslClientTransport: Sending mechanism name GSSAPI 
and initial response of length 607
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
START and payload length 6
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 607
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Start message handled
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status 
OK and payload length 108
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 0
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status 
OK and payload length 32
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
COMPLETE and payload length 32
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Main negotiation loop complete
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: SASL Client receiving last 
message
 18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status 
COMPLETE and payload length 0
 18/03/23 19:26:18 INFO metastore: Connected to metastore.
{quote}
The reason is that I use an earlier branch to build stuff for the customer 
which does not contain the commit that created the regression issue. Thank you 
though, there is a regression I should know for the next releases and will 
follow the work being done. My problem is that I tried to fetch delegation 
tokens earlier so consequent operations dont use a TGT all the time but hit 
this issue with HadoopRDD. I believed I could add the delegation tokens when 
the mesos scheduler backend starts like in the case of yarn where Client.java 
does something similar.


was (Author: skonto):
[~q79969786] I see the PRs you created to fix the other PR, btw the 
doAsRealUser does the work:

 
{quote}18/03/23 19:26:18 DEBUG UserGroupInformation: PrivilegedAction 
as:hive@LOCAL (auth:KERBEROS) 
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)

18/03/23 19:26:18 DEBUG TSaslTransport: opening transport 
org.apache.thrift.transport.TSaslClientTransport@64201482
18/03/23 19:26:18 DEBUG TSaslClientTransport: Sending mechanism name GSSAPI and 
initial response of length 607
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
START and payload length 6
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 607
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Start message handled
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status OK 
and payload length 108
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 0
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status OK 
and payload length 32
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
COMPLETE and payload length 32
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Main negotiation loop complete
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: SASL Client receiving last 
message
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status 
COMPLETE and payload length 0
18/03/23 19:26:18 INFO metastore: Connected to metastore.
{quote}
The reason is that I use an earlier branch to build stuff for the customer 
which does not contain the commit. Thank you though there is a regression I 
should know for the next releases and will follow the work being done. My 
problem is that I tried to fetch delegation tokens earlier so consequent 
operations dont use a TGT all the time but hit this issue with HadoopRDD. I 
believed I could add the delegation tokens when the mesos scheduler backend 
starts like in the case of yarn where Client.java does something similar.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This 

[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-25 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412974#comment-16412974
 ] 

Stavros Kontopoulos commented on SPARK-23790:
-

[~q79969786] I see the PRs you created to fix the other PR, btw the 
doAsRealUser does the work:

 
{quote}18/03/23 19:26:18 DEBUG UserGroupInformation: PrivilegedAction 
as:hive@LOCAL (auth:KERBEROS) 
from:org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)

18/03/23 19:26:18 DEBUG TSaslTransport: opening transport 
org.apache.thrift.transport.TSaslClientTransport@64201482
18/03/23 19:26:18 DEBUG TSaslClientTransport: Sending mechanism name GSSAPI and 
initial response of length 607
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
START and payload length 6
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 607
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Start message handled
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status OK 
and payload length 108
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status OK 
and payload length 0
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status OK 
and payload length 32
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Writing message with status 
COMPLETE and payload length 32
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Main negotiation loop complete
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: SASL Client receiving last 
message
18/03/23 19:26:18 DEBUG TSaslTransport: CLIENT: Received message with status 
COMPLETE and payload length 0
18/03/23 19:26:18 INFO metastore: Connected to metastore.
{quote}
The reason is that I use an earlier branch to build stuff for the customer 
which does not contain the commit. Thank you though there is a regression I 
should know for the next releases and will follow the work being done. My 
problem is that I tried to fetch delegation tokens earlier so consequent 
operations dont use a TGT all the time but hit this issue with HadoopRDD. I 
believed I could add the delegation tokens when the mesos scheduler backend 
starts like in the case of yarn where Client.java does something similar.

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23790) proxy-user failed connecting to a kerberos configured metastore

2018-03-25 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412960#comment-16412960
 ] 

Yuming Wang commented on SPARK-23790:
-

Can you try https://github.com/apache/spark/pull/20898?

> proxy-user failed connecting to a kerberos configured metastore
> ---
>
> Key: SPARK-23790
> URL: https://issues.apache.org/jira/browse/SPARK-23790
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> This appeared at a customer trying to integrate with a kerberized hdfs 
> cluster.
> This can be easily fixed with the proposed fix 
> [here|https://github.com/apache/spark/pull/17333] and the problem was 
> reported first [here|https://issues.apache.org/jira/browse/SPARK-19995] for 
> yarn.
> The other option is to add the delegation tokens to the current user's UGI as 
> in [here|https://github.com/apache/spark/pull/17335] . The last fixes the 
> problem but leads to a failure when someones uses a HadoopRDD because the 
> latter, uses FileInputFormat to get the splits which calls the local ticket 
> cache by using TokenCache.obtainTokensForNamenodes. Eventually this will fail 
> with:
> {quote}Exception in thread "main" 
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authenticationat 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5896)
> {quote}
> This implies that security mode is SIMPLE and hadoop libs there are not aware 
> of kerberos.
> This is related to this issue the workaround decided was to 
> [trick|https://github.com/apache/spark/blob/a33655348c4066d9c1d8ad2055aadfbc892ba7fd/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L795-L804]
>  hadoop.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23792) Documentation improvements for datetime functions

2018-03-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412949#comment-16412949
 ] 

Apache Spark commented on SPARK-23792:
--

User 'abradbury' has created a pull request for this issue:
https://github.com/apache/spark/pull/20901

> Documentation improvements for datetime functions
> -
>
> Key: SPARK-23792
> URL: https://issues.apache.org/jira/browse/SPARK-23792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: A Bradbury
>Priority: Minor
>
> Added details about the supported column input types, the column return type, 
> behaviour on invalid input, supporting examples and clarifications to the 
> datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 
> These changes stemmed from confusion over behaviour of the `date_add` method. 
> On first use I thought it would add the specified days to the input 
> timestamp, but it also truncated (cast) the input timestamp to a date, 
> loosing the time part. 
> Some examples:
>  * Noted that the week definition for `dayofweek` method starts on a Sunday
>  * Corrected documentation for methods such as `last_day` that only listed 
> one type of input i.e. "date column" changed to "date, timestamp or string"
>  * Renamed the parameters of the `months_between` method to match those of 
> the `datediff` method and to indicate which parameter is expected to be 
> before then other chronologically
>  * `from_unixtime` documentation referenced the "given format" when there was 
> no format parameter
>  * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
> seconds would be returned (implying 1521926327) when they would actually 
> return the input cast to a timestamp type 
> Some observations:
>  * The first day of the week by the `dayofweek` method is a Sunday, but by 
> the `weekofyear` method it is a Monday
>  * The `datediff` method returns a integer value, even with timestamp input, 
> whereas the `months_between` method returns a double, which seems inconsistent
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23792) Documentation improvements for datetime functions

2018-03-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23792:


Assignee: (was: Apache Spark)

> Documentation improvements for datetime functions
> -
>
> Key: SPARK-23792
> URL: https://issues.apache.org/jira/browse/SPARK-23792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: A Bradbury
>Priority: Minor
>
> Added details about the supported column input types, the column return type, 
> behaviour on invalid input, supporting examples and clarifications to the 
> datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 
> These changes stemmed from confusion over behaviour of the `date_add` method. 
> On first use I thought it would add the specified days to the input 
> timestamp, but it also truncated (cast) the input timestamp to a date, 
> loosing the time part. 
> Some examples:
>  * Noted that the week definition for `dayofweek` method starts on a Sunday
>  * Corrected documentation for methods such as `last_day` that only listed 
> one type of input i.e. "date column" changed to "date, timestamp or string"
>  * Renamed the parameters of the `months_between` method to match those of 
> the `datediff` method and to indicate which parameter is expected to be 
> before then other chronologically
>  * `from_unixtime` documentation referenced the "given format" when there was 
> no format parameter
>  * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
> seconds would be returned (implying 1521926327) when they would actually 
> return the input cast to a timestamp type 
> Some observations:
>  * The first day of the week by the `dayofweek` method is a Sunday, but by 
> the `weekofyear` method it is a Monday
>  * The `datediff` method returns a integer value, even with timestamp input, 
> whereas the `months_between` method returns a double, which seems inconsistent
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23792) Documentation improvements for datetime functions

2018-03-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23792:


Assignee: Apache Spark

> Documentation improvements for datetime functions
> -
>
> Key: SPARK-23792
> URL: https://issues.apache.org/jira/browse/SPARK-23792
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: A Bradbury
>Assignee: Apache Spark
>Priority: Minor
>
> Added details about the supported column input types, the column return type, 
> behaviour on invalid input, supporting examples and clarifications to the 
> datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 
> These changes stemmed from confusion over behaviour of the `date_add` method. 
> On first use I thought it would add the specified days to the input 
> timestamp, but it also truncated (cast) the input timestamp to a date, 
> loosing the time part. 
> Some examples:
>  * Noted that the week definition for `dayofweek` method starts on a Sunday
>  * Corrected documentation for methods such as `last_day` that only listed 
> one type of input i.e. "date column" changed to "date, timestamp or string"
>  * Renamed the parameters of the `months_between` method to match those of 
> the `datediff` method and to indicate which parameter is expected to be 
> before then other chronologically
>  * `from_unixtime` documentation referenced the "given format" when there was 
> no format parameter
>  * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
> seconds would be returned (implying 1521926327) when they would actually 
> return the input cast to a timestamp type 
> Some observations:
>  * The first day of the week by the `dayofweek` method is a Sunday, but by 
> the `weekofyear` method it is a Monday
>  * The `datediff` method returns a integer value, even with timestamp input, 
> whereas the `months_between` method returns a double, which seems inconsistent
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23792) Documentation improvements for datetime functions

2018-03-25 Thread A Bradbury (JIRA)
A Bradbury created SPARK-23792:
--

 Summary: Documentation improvements for datetime functions
 Key: SPARK-23792
 URL: https://issues.apache.org/jira/browse/SPARK-23792
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.3.0
Reporter: A Bradbury


Added details about the supported column input types, the column return type, 
behaviour on invalid input, supporting examples and clarifications to the 
datetime functions in `org.apache.spark.sql.functions` for Java/Scala. 

These changes stemmed from confusion over behaviour of the `date_add` method. 
On first use I thought it would add the specified days to the input timestamp, 
but it also truncated (cast) the input timestamp to a date, loosing the time 
part. 

Some examples:
 * Noted that the week definition for `dayofweek` method starts on a Sunday
 * Corrected documentation for methods such as `last_day` that only listed one 
type of input i.e. "date column" changed to "date, timestamp or string"
 * Renamed the parameters of the `months_between` method to match those of the 
`datediff` method and to indicate which parameter is expected to be before then 
other chronologically
 * `from_unixtime` documentation referenced the "given format" when there was 
no format parameter
 * Documentation for `to_timestamp` methods detailed that a unix timestamp in 
seconds would be returned (implying 1521926327) when they would actually return 
the input cast to a timestamp type 

Some observations:
 * The first day of the week by the `dayofweek` method is a Sunday, but by the 
`weekofyear` method it is a Monday
 * The `datediff` method returns a integer value, even with timestamp input, 
whereas the `months_between` method returns a double, which seems inconsistent

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-03-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412918#comment-16412918
 ] 

Felix Cheung commented on SPARK-23780:
--

though there are other methods

 

[https://www.rforge.net/doc/packages/JSON/toJSON.html]

 

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-25 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412917#comment-16412917
 ] 

Xiao Li commented on SPARK-23710:
-

What are the potential behavior changes this JIRA could introduce? 

Based on the PR https://github.com/apache/spark/pull/20659, the risk is pretty 
high. 

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23255) Add user guide and examples for DataFrame image reading functions

2018-03-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412908#comment-16412908
 ] 

Hyukjin Kwon commented on SPARK-23255:
--

Hey [~imatiach], are you interested in this?

> Add user guide and examples for DataFrame image reading functions
> -
>
> Key: SPARK-23255
> URL: https://issues.apache.org/jira/browse/SPARK-23255
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-21866 added built-in support for reading image data into a DataFrame. 
> This new functionality should be documented in the user guide, with example 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23784) Cannot use custom Aggregator with groupBy/agg

2018-03-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23784:
-
Component/s: (was: Project Infra)
 SQL

> Cannot use custom Aggregator with groupBy/agg 
> --
>
> Key: SPARK-23784
> URL: https://issues.apache.org/jira/browse/SPARK-23784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joshua Howard
>Priority: Major
>
> {{I have code 
> [here|[https://stackoverflow.com/questions/49440766/trouble-getting-spark-aggregators-to-work],]
>  where I am trying to use an Aggregator with both the `select` and `agg` 
> functions. I cannot seem to get this to work in Spark 2.3.0. 
> [Here|https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html]
>  is a blog post that appears to be using this functionality in Spark 1.6, but 
> It appears to no longer work. }}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23761) Dataframe filter(udf) followed by groupby in pyspark throws a casting error

2018-03-25 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23761.
--
Resolution: Cannot Reproduce

> Dataframe filter(udf) followed by groupby in pyspark throws a casting error
> ---
>
> Key: SPARK-23761
> URL: https://issues.apache.org/jira/browse/SPARK-23761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: pyspark 1.6.0
> Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
> CentOS 6.7
>Reporter: Dhaniram Kshirsagar
>Priority: Major
>
> On pyspark with dataframe, we are getting following exception when 
> 'filter(with UDF) is followed by groupby' :-
> # Snippet of error observed in pyspark
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o56.filter.
> : java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate{code}
> This one looks like https://issues.apache.org/jira/browse/SPARK-12981 however 
> not sure if this one is same.
>  
> Here is gist with pyspark steps to reproduce this issue:
> [https://gist.github.com/dhaniram-kshirsagar/d72545620b6a05d145a1a6bece797b6d]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org