[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25418:


Assignee: (was: Apache Spark)

> The metadata of DataSource table should not include Hive-generated storage 
> properties.
> --
>
> Key: SPARK-25418
> URL: https://issues.apache.org/jira/browse/SPARK-25418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> When Hive support enabled, Hive catalog puts extra storage properties into 
> table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25418:


Assignee: Apache Spark

> The metadata of DataSource table should not include Hive-generated storage 
> properties.
> --
>
> Key: SPARK-25418
> URL: https://issues.apache.org/jira/browse/SPARK-25418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> When Hive support enabled, Hive catalog puts extra storage properties into 
> table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613040#comment-16613040
 ] 

Apache Spark commented on SPARK-25418:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22410

> The metadata of DataSource table should not include Hive-generated storage 
> properties.
> --
>
> Key: SPARK-25418
> URL: https://issues.apache.org/jira/browse/SPARK-25418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> When Hive support enabled, Hive catalog puts extra storage properties into 
> table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25418) The metadata of DataSource table should not include Hive-generated storage properties.

2018-09-12 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-25418:
-

 Summary: The metadata of DataSource table should not include 
Hive-generated storage properties.
 Key: SPARK-25418
 URL: https://issues.apache.org/jira/browse/SPARK-25418
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takuya Ueshin


When Hive support enabled, Hive catalog puts extra storage properties into 
table metadata even for DataSource tables, but we should not have them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613033#comment-16613033
 ] 

Apache Spark commented on SPARK-25352:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22409

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16613032#comment-16613032
 ] 

Apache Spark commented on SPARK-25352:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22409

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.

2018-09-12 Thread Yinan Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li resolved SPARK-25295.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Pod names conflicts in client mode, if previous submission was not a clean 
> shutdown.
> 
>
> Key: SPARK-25295
> URL: https://issues.apache.org/jira/browse/SPARK-25295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
> Fix For: 2.4.0
>
>
> If the previous job was killed somehow, by disconnecting the client. It 
> leaves behind the executor pods named spark-exec-#, which cause naming 
> conflicts and failures for the next job submission.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods 
> "spark-exec-4" already exists. Received status: Status(apiVersion=v1, 
> code=409, details=StatusDetails(causes=[], group=null, kind=pods, 
> name=spark-exec-4, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=pods "spark-exec-4" already 
> exists, metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25415:

Priority: Major  (was: Minor)

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-25415.
-
   Resolution: Fixed
 Assignee: Maryann Xue
Fix Version/s: 3.0.0

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Maryann Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612994#comment-16612994
 ] 

Hyukjin Kwon commented on SPARK-25397:
--

[~josephkb], do you want to backport this bit or just resolve this? Either way 
sounds okay to me.

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612965#comment-16612965
 ] 

Hyukjin Kwon commented on SPARK-25378:
--

{quote}
If it is not pubic, why didn't we hide it in the first place?
{quote}

Because we already state the package itself it not meant to be public .. - 
https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L21-L22

These modifiers were removed in SPARK-16813 for this reason.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-12 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612947#comment-16612947
 ] 

Wenchen Fan commented on SPARK-25378:
-

[~viirya] Can you take a look and see how hard it is to fix it?

After a quick look, I think this works in 2.3 if and only if: the 
`GenericArrayData` is created with `Array[String]` (i.e. a malformed 
ArrayData), and we wrongly call the `toArray[String](StringType)` method.

A quick solution is to revert SPARK-23875 from 2.4, but then we sacrifice 
performance to retain a buggy but backward-compatible behavior. So we need to 
make a trade off here.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25357) Add metadata to SparkPlanInfo to dump more information like file path to event log

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25357:
---

Assignee: Lantao Jin

> Add metadata to SparkPlanInfo to dump more information like file path to 
> event log
> --
>
> Key: SPARK-25357
> URL: https://issues.apache.org/jira/browse/SPARK-25357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Minor
> Fix For: 2.3.2, 2.4.0
>
>
> Field {{metadata}} removed from {{SparkPlanInfo}} in SPARK-17701. 
> Corresponding, this field was also removed from event 
> {{SparkListenerSQLExecutionStart}} in Spark event log. If we want to analyze 
> event log to get some fields which wider than 100 (e.g the Location or 
> ReadSchema of FileScan), they are abbreviated in {{simpleString}} of 
> SparkPlanInfo JSON or {{physicalPlanDescription}} JSON.
> Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log (It 
> contains the metadata field):
> {quote}Location: 
> InMemoryFileIndex[hdfs://hercules/sys/edw/prs_idm/idm_cbt_am_t/cbt/cbt_acct_prfl_info/snapshot/dt...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct"
> {quote}
> So I add this field back to SparkPlanInfo class. Then it will log out the 
> meta data to event log. Intact information in event log is very useful for 
> offline job analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25402) Null handling in BooleanSimplification

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25402.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   2.3.2

> Null handling in BooleanSimplification
> --
>
> Key: SPARK-25402
> URL: https://issues.apache.org/jira/browse/SPARK-25402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.2, 2.4.0
>
>
> SPARK-20350 introduced a bug BooleanSimplification for null handling. For 
> example, the following case returns a wrong answer. 
> {code}
> val schema = StructType.fromDDL("a boolean, b int")
> val rows = Seq(Row(null, 1))
> val rdd = sparkContext.parallelize(rows)
> val df = spark.createDataFrame(rdd, schema)
> checkAnswer(df.where("(NOT a) OR a"), Seq.empty)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25357) Add metadata to SparkPlanInfo to dump more information like file path to event log

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25357.
-
   Resolution: Fixed
Fix Version/s: 2.3.2
   2.4.0

Issue resolved by pull request 22353
[https://github.com/apache/spark/pull/22353]

> Add metadata to SparkPlanInfo to dump more information like file path to 
> event log
> --
>
> Key: SPARK-25357
> URL: https://issues.apache.org/jira/browse/SPARK-25357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Lantao Jin
>Priority: Minor
> Fix For: 2.4.0, 2.3.2
>
>
> Field {{metadata}} removed from {{SparkPlanInfo}} in SPARK-17701. 
> Corresponding, this field was also removed from event 
> {{SparkListenerSQLExecutionStart}} in Spark event log. If we want to analyze 
> event log to get some fields which wider than 100 (e.g the Location or 
> ReadSchema of FileScan), they are abbreviated in {{simpleString}} of 
> SparkPlanInfo JSON or {{physicalPlanDescription}} JSON.
> Before 2.3, the fragment of SparkListenerSQLExecutionStart in event log (It 
> contains the metadata field):
> {quote}Location: 
> InMemoryFileIndex[hdfs://hercules/sys/edw/prs_idm/idm_cbt_am_t/cbt/cbt_acct_prfl_info/snapshot/dt...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct"
> {quote}
> So I add this field back to SparkPlanInfo class. Then it will log out the 
> meta data to event log. Intact information in event log is very useful for 
> offline job analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25387) Malformed CSV causes NPE

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25387.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22374
[https://github.com/apache/spark/pull/22374]

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25387) Malformed CSV causes NPE

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25387:
---

Assignee: Maxim Gekk

> Malformed CSV causes NPE
> 
>
> Key: SPARK-25387
> URL: https://issues.apache.org/jira/browse/SPARK-25387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Loading a malformed CSV files or a dataset can cause NullPointerException, 
> for example the code:
> {code:scala}
> val schema = StructType(StructField("a", IntegerType) :: Nil)
> val input = spark.createDataset(Seq("\u\u\u0001234"))
> spark.read.schema(schema).csv(input).collect()
> {code} 
> crashes with the exception:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
>   at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:210)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.DataFrameReader$$anonfun$11$$anonfun$12.apply(DataFrameReader.scala:523)
>   at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:68)
> {code}
> If schema is not specified, the following exception is thrown:
> {code:java}
> java.lang.NullPointerException
>   at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:192)
>   at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:192)
>   at 
> scala.collection.IndexedSeqOptimized$class.zipWithIndex(IndexedSeqOptimized.scala:99)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.zipWithIndex(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.makeSafeHeader(CSVDataSource.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:247)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23820) Allow the long form of call sites to be recorded in the log

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23820.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22398
[https://github.com/apache/spark/pull/22398]

> Allow the long form of call sites to be recorded in the log
> ---
>
> Key: SPARK-23820
> URL: https://issues.apache.org/jira/browse/SPARK-23820
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Michael Mior
>Assignee: Michael Mior
>Priority: Trivial
> Fix For: 2.4.0
>
>
> It would be nice if the long form of the callsite information could be 
> included in the log. An example of what I'm proposing is here: 
> https://github.com/michaelmior/spark/commit/4b4076cfb1d51ceb20fd2b0a3b1b5be2aebb6416



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894
 ] 

Stavros Kontopoulos edited comment on SPARK-25291 at 9/13/18 12:41 AM:
---

[~ifilonenko] I can have a look a bit weird, but kind of expected as these 
tests are the only ones that use fabric8io client to connect to the running 
pod. There is no good way to do this right now.


was (Author: skonto):
[~ifilonenko] I can have a look a bit weird.

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894
 ] 

Stavros Kontopoulos edited comment on SPARK-25291 at 9/13/18 12:41 AM:
---

[~ifilonenko] I can have a look a bit weird, but kind of expected as these 
tests are the only ones that use fabric8io client to connect to the running 
pod. There is no good way to do this right now.

I will try to debug it.


was (Author: skonto):
[~ifilonenko] I can have a look a bit weird, but kind of expected as these 
tests are the only ones that use fabric8io client to connect to the running 
pod. There is no good way to do this right now.

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-09-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612894#comment-16612894
 ] 

Stavros Kontopoulos commented on SPARK-25291:
-

[~ifilonenko] I can have a look a bit weird.

> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23012) Support for predicate pushdown and partition pruning when left joining large Hive tables

2018-09-12 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612881#comment-16612881
 ] 

Yuming Wang commented on SPARK-23012:
-

It seems the following PR resolves your issue: 
https://github.com/apache/spark/pull/20816

> Support for predicate pushdown and partition pruning when left joining large 
> Hive tables
> 
>
> Key: SPARK-23012
> URL: https://issues.apache.org/jira/browse/SPARK-23012
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.2.0
>Reporter: Rick Kramer
>Priority: Major
>
> We have a hive view which left outer joins several large, partitioned orc 
> hive tables together on date. When the view is used in a hive query, hive 
> pushes date predicates down into the joins and prunes the partitions for all 
> tables. When I use this view from pyspark, the predicate is only used to 
> prune the left-most table and all partitions from the additional tables are 
> selected.
> For example, consider two partitioned hive tables a & b joined in a view:
> create table a (
>a_val string
> )
> partitioned by (ds string)
> stored as orc;
> create table b (
>b_val string
> )
> partitioned by (ds string)
> stored as orc;
> create view example_view as
> select
> a_val
> , b_val
> , ds
> from a 
> left outer join b on b.ds = a.ds
> Then in pyspark you might try to query from the view filtering on ds:
> spark.table('example_view').filter(F.col('ds') == '2018-01-01')
> If table a and b are large, this results in a plan that filters a on ds = 
> 2018-01-01, but selects scans all partitions of table b.
> If the join in the view is changed to an inner join, the predicate gets 
> pushed down to a & b and the partitions are pruned as you'd expect.
> In practice, the view is fairly complex and contains a lot of business logic 
> we'd prefer not to replicate in pyspark if we can avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25416:


Assignee: Apache Spark

> ArrayPosition function may return incorrect result when right expression is 
> implicitly downcasted.
> --
>
> Key: SPARK-25416
> URL: https://issues.apache.org/jira/browse/SPARK-25416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>
> In ArrayPosition, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> spark-sql> select array_position(array(1), 1.34);
> 1
> spark-sql> select array_position(array(1), 'foo');
> null
> We should safely coerce both left and right hand side expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612817#comment-16612817
 ] 

Apache Spark commented on SPARK-25416:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22407

> ArrayPosition function may return incorrect result when right expression is 
> implicitly downcasted.
> --
>
> Key: SPARK-25416
> URL: https://issues.apache.org/jira/browse/SPARK-25416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayPosition, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> spark-sql> select array_position(array(1), 1.34);
> 1
> spark-sql> select array_position(array(1), 'foo');
> null
> We should safely coerce both left and right hand side expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612818#comment-16612818
 ] 

Apache Spark commented on SPARK-25416:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22407

> ArrayPosition function may return incorrect result when right expression is 
> implicitly downcasted.
> --
>
> Key: SPARK-25416
> URL: https://issues.apache.org/jira/browse/SPARK-25416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayPosition, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> spark-sql> select array_position(array(1), 1.34);
> 1
> spark-sql> select array_position(array(1), 'foo');
> null
> We should safely coerce both left and right hand side expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25416:


Assignee: (was: Apache Spark)

> ArrayPosition function may return incorrect result when right expression is 
> implicitly downcasted.
> --
>
> Key: SPARK-25416
> URL: https://issues.apache.org/jira/browse/SPARK-25416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayPosition, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> spark-sql> select array_position(array(1), 1.34);
> 1
> spark-sql> select array_position(array(1), 'foo');
> null
> We should safely coerce both left and right hand side expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25417:


Assignee: Apache Spark

> ArrayContains function may return incorrect result when right expression is 
> implicitly down casted
> --
>
> Key: SPARK-25417
> URL: https://issues.apache.org/jira/browse/SPARK-25417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>
> In ArrayContains, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> {code:java}
> spark-sql> select array_position(array(1), 1.34);
> true
> {code}
>  
> {code:java}
> spark-sql> select array_position(array(1), 'foo');
> null
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612816#comment-16612816
 ] 

Apache Spark commented on SPARK-25417:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22408

> ArrayContains function may return incorrect result when right expression is 
> implicitly down casted
> --
>
> Key: SPARK-25417
> URL: https://issues.apache.org/jira/browse/SPARK-25417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayContains, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> {code:java}
> spark-sql> select array_position(array(1), 1.34);
> true
> {code}
>  
> {code:java}
> spark-sql> select array_position(array(1), 'foo');
> null
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25417:


Assignee: (was: Apache Spark)

> ArrayContains function may return incorrect result when right expression is 
> implicitly down casted
> --
>
> Key: SPARK-25417
> URL: https://issues.apache.org/jira/browse/SPARK-25417
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayContains, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> {code:java}
> spark-sql> select array_position(array(1), 1.34);
> true
> {code}
>  
> {code:java}
> spark-sql> select array_position(array(1), 'foo');
> null
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25417) ArrayContains function may return incorrect result when right expression is implicitly down casted

2018-09-12 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-25417:


 Summary: ArrayContains function may return incorrect result when 
right expression is implicitly down casted
 Key: SPARK-25417
 URL: https://issues.apache.org/jira/browse/SPARK-25417
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dilip Biswal


In ArrayContains, we currently cast the right hand side expression to match the 
element type of the left hand side Array. This may result in down casting and 
may return wrong result or questionable result.

Example :

{code:java}
spark-sql> select array_position(array(1), 1.34);
true


{code}
 
{code:java}
spark-sql> select array_position(array(1), 'foo');
null
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612804#comment-16612804
 ] 

Apache Spark commented on SPARK-25415:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22407

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612803#comment-16612803
 ] 

Apache Spark commented on SPARK-25415:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22407

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25415:


Assignee: (was: Apache Spark)

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612766#comment-16612766
 ] 

Apache Spark commented on SPARK-25415:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/22406

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Priority: Minor
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25415:


Assignee: Apache Spark

> Make plan change log in RuleExecutor configurable by SQLConf
> 
>
> Key: SPARK-25415
> URL: https://issues.apache.org/jira/browse/SPARK-25415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maryann Xue
>Assignee: Apache Spark
>Priority: Minor
>
> In RuleExecutor, after applying a rule, if the plan has changed, the before 
> and after plan will be logged using level "trace". At times, however, such 
> information can be very helpful for debugging, so making the log level 
> configurable in SQLConf would allow users to turn on the plan change log 
> independently and save the trouble of tweaking log4j settings.
> Meanwhile, filtering plan change log for specific rules can also be very 
> useful.
> So I propose adding two confs:
> 1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
> logging plan changes after a rule is applied.
> 2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
> for a set of specified rules, separated by commas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25416) ArrayPosition function may return incorrect result when right expression is implicitly downcasted.

2018-09-12 Thread Dilip Biswal (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dilip Biswal updated SPARK-25416:
-
Summary: ArrayPosition function may return incorrect result when right 
expression is implicitly downcasted.  (was: ArrayPosition may return incorrect 
result when right expression is downcasted.)

> ArrayPosition function may return incorrect result when right expression is 
> implicitly downcasted.
> --
>
> Key: SPARK-25416
> URL: https://issues.apache.org/jira/browse/SPARK-25416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dilip Biswal
>Priority: Major
>
> In ArrayPosition, we currently cast the right hand side expression to match 
> the element type of the left hand side Array. This may result in down casting 
> and may return wrong result or questionable result.
> Example :
> spark-sql> select array_position(array(1), 1.34);
> 1
> spark-sql> select array_position(array(1), 'foo');
> null
> We should safely coerce both left and right hand side expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25416) ArrayPosition may return incorrect result when right expression is downcasted.

2018-09-12 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-25416:


 Summary: ArrayPosition may return incorrect result when right 
expression is downcasted.
 Key: SPARK-25416
 URL: https://issues.apache.org/jira/browse/SPARK-25416
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dilip Biswal


In ArrayPosition, we currently cast the right hand side expression to match the 
element type of the left hand side Array. This may result in down casting and 
may return wrong result or questionable result.

Example :
spark-sql> select array_position(array(1), 1.34);
1

spark-sql> select array_position(array(1), 'foo');
null

We should safely coerce both left and right hand side expressions.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25415) Make plan change log in RuleExecutor configurable by SQLConf

2018-09-12 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-25415:
---

 Summary: Make plan change log in RuleExecutor configurable by 
SQLConf
 Key: SPARK-25415
 URL: https://issues.apache.org/jira/browse/SPARK-25415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Maryann Xue


In RuleExecutor, after applying a rule, if the plan has changed, the before and 
after plan will be logged using level "trace". At times, however, such 
information can be very helpful for debugging, so making the log level 
configurable in SQLConf would allow users to turn on the plan change log 
independently and save the trouble of tweaking log4j settings.
Meanwhile, filtering plan change log for specific rules can also be very useful.
So I propose adding two confs:
1. spark.sql.optimizer.planChangeLog.level - set a specific log level for 
logging plan changes after a rule is applied.
2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only 
for a set of specified rules, separated by commas.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612662#comment-16612662
 ] 

Apache Spark commented on SPARK-25295:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/22405

> Pod names conflicts in client mode, if previous submission was not a clean 
> shutdown.
> 
>
> Key: SPARK-25295
> URL: https://issues.apache.org/jira/browse/SPARK-25295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> If the previous job was killed somehow, by disconnecting the client. It 
> leaves behind the executor pods named spark-exec-#, which cause naming 
> conflicts and failures for the next job submission.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods 
> "spark-exec-4" already exists. Received status: Status(apiVersion=v1, 
> code=409, details=StatusDetails(causes=[], group=null, kind=pods, 
> name=spark-exec-4, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=pods "spark-exec-4" already 
> exists, metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25295:


Assignee: Apache Spark

> Pod names conflicts in client mode, if previous submission was not a clean 
> shutdown.
> 
>
> Key: SPARK-25295
> URL: https://issues.apache.org/jira/browse/SPARK-25295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> If the previous job was killed somehow, by disconnecting the client. It 
> leaves behind the executor pods named spark-exec-#, which cause naming 
> conflicts and failures for the next job submission.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods 
> "spark-exec-4" already exists. Received status: Status(apiVersion=v1, 
> code=409, details=StatusDetails(causes=[], group=null, kind=pods, 
> name=spark-exec-4, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=pods "spark-exec-4" already 
> exists, metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25295:


Assignee: (was: Apache Spark)

> Pod names conflicts in client mode, if previous submission was not a clean 
> shutdown.
> 
>
> Key: SPARK-25295
> URL: https://issues.apache.org/jira/browse/SPARK-25295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> If the previous job was killed somehow, by disconnecting the client. It 
> leaves behind the executor pods named spark-exec-#, which cause naming 
> conflicts and failures for the next job submission.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods 
> "spark-exec-4" already exists. Received status: Status(apiVersion=v1, 
> code=409, details=StatusDetails(causes=[], group=null, kind=pods, 
> name=spark-exec-4, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=pods "spark-exec-4" already 
> exists, metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2018-09-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20350:
--
Component/s: (was: Optimizer)
 SQL

> Apply Complementation Laws during boolean expression simplification
> ---
>
> Key: SPARK-20350
> URL: https://issues.apache.org/jira/browse/SPARK-20350
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Michael Styles
>Priority: Major
> Fix For: 2.2.0, 2.3.0
>
>
> Apply Complementation Laws during boolean expression simplification.
> * A AND NOT(A) == FALSE
> * A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2018-09-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-20799.
-

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2018-09-12 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-20799.
---
Resolution: Won't Fix

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2018-09-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612642#comment-16612642
 ] 

Dongjoon Hyun commented on SPARK-20799:
---

+1 for closing this as a WONTFIX.

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25402) Null handling in BooleanSimplification

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612530#comment-16612530
 ] 

Apache Spark commented on SPARK-25402:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22403

> Null handling in BooleanSimplification
> --
>
> Key: SPARK-25402
> URL: https://issues.apache.org/jira/browse/SPARK-25402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> SPARK-20350 introduced a bug BooleanSimplification for null handling. For 
> example, the following case returns a wrong answer. 
> {code}
> val schema = StructType.fromDDL("a boolean, b int")
> val rows = Seq(Row(null, 1))
> val rdd = sparkContext.parallelize(rows)
> val df = spark.createDataFrame(rdd, schema)
> checkAnswer(df.where("(NOT a) OR a"), Seq.empty)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25402) Null handling in BooleanSimplification

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612529#comment-16612529
 ] 

Apache Spark commented on SPARK-25402:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/22403

> Null handling in BooleanSimplification
> --
>
> Key: SPARK-25402
> URL: https://issues.apache.org/jira/browse/SPARK-25402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
>
> SPARK-20350 introduced a bug BooleanSimplification for null handling. For 
> example, the following case returns a wrong answer. 
> {code}
> val schema = StructType.fromDDL("a boolean, b int")
> val rows = Seq(Row(null, 1))
> val rdd = sparkContext.parallelize(rows)
> val df = spark.createDataFrame(rdd, schema)
> checkAnswer(df.where("(NOT a) OR a"), Seq.empty)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25363) Schema pruning doesn't work if nested column is used in where clause

2018-09-12 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-25363.
-
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22357
[https://github.com/apache/spark/pull/22357]

> Schema pruning doesn't work if nested column is used in where clause
> 
>
> Key: SPARK-25363
> URL: https://issues.apache.org/jira/browse/SPARK-25363
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0, 2.4.0
>
>
> Schema pruning doesn't work if nested column is used in where clause.
> For example,
> {code}
> sql("select name.first from contacts where name.first = 'David'")
> == Physical Plan ==
> *(1) Project [name#19.first AS first#40]
> +- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
>+- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, 
> PartitionFilters: [], 
> PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct>
> {code}
> In above query plan, the scan node reads the entire schema of `name` column.
> This issue is reported by:
> https://github.com/apache/spark/pull/21320#issuecomment-419290197



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612436#comment-16612436
 ] 

Joseph K. Bradley commented on SPARK-25321:
---

You're right; these are breaking changes.  If we're sticking with the rules, 
then we should revert these in branch-2.4, but we could keep them in master if 
the next release is 3.0.  Is it easy to revert these PRs, or have they 
collected conflicts by now?

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-12 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612433#comment-16612433
 ] 

Marcelo Vanzin commented on SPARK-25380:


Yep. That's a 200MB plan description string...

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25378) ArrayData.toArray(StringType) assume UTF8String in 2.4

2018-09-12 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612399#comment-16612399
 ] 

Xiangrui Meng commented on SPARK-25378:
---

Comments from [~vomjom] at https://github.com/tensorflow/ecosystem/pull/100:

{quote}
We currently only do releases along with TensorFlow releases, and the next one 
that'll include this is TF 1.12.
{quote}

This means Spark+TF users cannot migrate to Spark 2.4 until TF 1.12 is 
released. I think we need to decide based on the impact instead of just saying 
"this is not a public API". If it is not pubic, why didn't we hide it in the 
first place? And as [~cloud_fan] mentioned, it is hard to implement data source 
without touching those "private" APIs.

> ArrayData.toArray(StringType) assume UTF8String in 2.4
> --
>
> Key: SPARK-25378
> URL: https://issues.apache.org/jira/browse/SPARK-25378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> The following code works in 2.3.1 but failed in 2.4.0-SNAPSHOT:
> {code}
> import org.apache.spark.sql.catalyst.util._
> import org.apache.spark.sql.types.StringType
> ArrayData.toArrayData(Array("a", "b")).toArray[String](StringType)
> res0: Array[String] = Array(a, b)
> {code}
> In 2.4.0-SNAPSHOT, the error is
> {code}java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.util.GenericArrayData.getUTF8String(GenericArrayData.scala:75)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at 
> org.apache.spark.sql.catalyst.InternalRow$$anonfun$getAccessor$8.apply(InternalRow.scala:136)
>   at org.apache.spark.sql.catalyst.util.ArrayData.toArray(ArrayData.scala:178)
>   ... 51 elided
> {code}
> cc: [~cloud_fan] [~yogeshg]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25295) Pod names conflicts in client mode, if previous submission was not a clean shutdown.

2018-09-12 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612337#comment-16612337
 ] 

Stavros Kontopoulos commented on SPARK-25295:
-

guys I started working on a short fix.

> Pod names conflicts in client mode, if previous submission was not a clean 
> shutdown.
> 
>
> Key: SPARK-25295
> URL: https://issues.apache.org/jira/browse/SPARK-25295
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> If the previous job was killed somehow, by disconnecting the client. It 
> leaves behind the executor pods named spark-exec-#, which cause naming 
> conflicts and failures for the next job submission.
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://:6443/api/v1/namespaces/default/pods. Message: pods 
> "spark-exec-4" already exists. Received status: Status(apiVersion=v1, 
> code=409, details=StatusDetails(causes=[], group=null, kind=pods, 
> name=spark-exec-4, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=pods "spark-exec-4" already 
> exists, metadata=ListMeta(resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=AlreadyExists, status=Failure, 
> additionalProperties={}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25352:
---

Assignee: Liang-Chi Hsieh

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25352.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22344
[https://github.com/apache/spark/pull/22344]

> Perform ordered global limit when limit number is bigger than 
> topKSortFallbackThreshold
> ---
>
> Key: SPARK-25352
> URL: https://issues.apache.org/jira/browse/SPARK-25352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We have optimization on global limit to evenly distribute limit rows across 
> all partitions. This optimization doesn't work for ordered results.
> For a query ending with sort + limit, in most cases it is performed by 
> `TakeOrderedAndProjectExec`.
> But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, 
> global limit will be used. At this moment, we need to do ordered global limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline

2018-09-12 Thread Ayush Anubhava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612085#comment-16612085
 ] 

Ayush Anubhava edited comment on SPARK-24627 at 9/12/18 12:53 PM:
--

Check the the principal name given in spark-default conf in driver side.

The principal name should be with realm so that  at the time of renewal , the 
HDFS Delegation token can be given to spark


was (Author: ayush007):
Check the the principal name given in spark-default conf in driver side.

> [Spark2.3.0] After HDFS Token expire kinit not able to submit job using 
> beeline
> ---
>
> Key: SPARK-24627
> URL: https://issues.apache.org/jira/browse/SPARK-24627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3.0 
> Hadoop: 2.8.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
> beeline session was active.
> 1.Launch spark-beeline 
> 2. create table alt_s1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',';
> 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1;
> 4. show tables;( Table listed successfully )
> 5. select * from alt_s1;
> Throws HDFS_DELEGATION_TOKEN Exception
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 22.0 (TID 106, blr123110, executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache
> at org.apache.hadoop.ipc.Client.call(Client.java:1475)
> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
> at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
> at 
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
> at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> **Note: Even after kinit spark/hadoop  token is not getting renewed.**
> Now Launch spark sql session ( Select * from alt_s1 ) is successful.
> 1. Launch spark-sql
> 2.spark-sql> select * from 

[jira] [Commented] (SPARK-24627) [Spark2.3.0] After HDFS Token expire kinit not able to submit job using beeline

2018-09-12 Thread Ayush Anubhava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612085#comment-16612085
 ] 

Ayush Anubhava commented on SPARK-24627:


Check the the principal name given in spark-default conf in driver side.

> [Spark2.3.0] After HDFS Token expire kinit not able to submit job using 
> beeline
> ---
>
> Key: SPARK-24627
> URL: https://issues.apache.org/jira/browse/SPARK-24627
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version: 2.3.0 
> Hadoop: 2.8.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
> beeline session was active.
> 1.Launch spark-beeline 
> 2. create table alt_s1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',';
> 3. load data local inpath '/opt/typeddata60.txt' into table alt_s1;
> 4. show tables;( Table listed successfully )
> 5. select * from alt_s1;
> Throws HDFS_DELEGATION_TOKEN Exception
> 0: jdbc:hive2://10.18.18.214:23040/default> select * from alt_s1;
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 22.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 22.0 (TID 106, blr123110, executor 1): 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  token (HDFS_DELEGATION_TOKEN token 7 for spark) can't be found in cache
> at org.apache.hadoop.ipc.Client.call(Client.java:1475)
> at org.apache.hadoop.ipc.Client.call(Client.java:1412)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy15.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
> at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy16.getBlockLocations(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.(DFSInputStream.java:264)
> at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
> at 
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:109)
> at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at 
> org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:257)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:256)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:214)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> **Note: Even after kinit spark/hadoop  token is not getting renewed.**
> Now Launch spark sql session ( Select * from alt_s1 ) is successful.
> 1. Launch spark-sql
> 2.spark-sql> select * from alt_s1;
> 2018-06-22 14:24:04 INFO  HiveMetaStore:746 - 0: get_table : db=test_one 
> tbl=alt_s1
> 2018-06-22 14:24:04 INFO  audit:371 - ugi=spark/had...@hadoop.com   
> ip=unknown-ip-addr  cmd=get_table : db=test_one tbl=alt_s1
> 2018-06-22 14:24:04 INFO  

[jira] [Resolved] (SPARK-25371) Vector Assembler with no input columns leads to opaque error

2018-09-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25371.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0
   2.3.2

> Vector Assembler with no input columns leads to opaque error
> 
>
> Key: SPARK-25371
> URL: https://issues.apache.org/jira/browse/SPARK-25371
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Victor Alor
>Assignee: Marco Gaido
>Priority: Trivial
> Fix For: 2.3.2, 2.4.0
>
>
> When `VectorAssembler ` is given an empty array as its inputColumns it throws 
> an opaque error. In versions less than 2.3 `VectorAssembler` it simply 
> appends a column containing empty vectors. 
>  
> {code:java}
> val inputCols = Array()
> val outputCols = Array("A")
> val vectorAssembler = new VectorAssembler()
> .setInputCols(inputCols)
> .setOutputCol(outputCols)
> vectorAssmbler.fit(data).transform(df)
> {code}
> In versions 2.3 > this throws the exception below
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve 'named_struct()' due 
> to data type mismatch: input to function named_struct requires at least one 
> argument;;
> {code}
> Whereas in versions less than 2.3 it just adds a column containing an empty 
> vector.
> I'm not certain if this is an intentional choice or an actual bug. If this is 
> a bug, the `VectorAssembler` should be modified to append an empty vector 
> column if it detects no inputCols.
>  
> If it is a design decision it would be nice to throw a human readable 
> exception explicitly stating inputColumns must not be empty. The current 
> error is somewhat opaque.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-09-12 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612026#comment-16612026
 ] 

Evelyn Bayes commented on SPARK-25150:
--

Hey Peter, don't stress it. I'm new to the community as well but I'm been a 
busy so all good :)

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application

2018-09-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612001#comment-16612001
 ] 

Steve Loughran commented on SPARK-20153:


bq. Amazon EMR does not currently support use of the Apache Hadoop S3A file 
system."

the Amazon EMR team are free to copy and paste any parts of the ASF-licensed 
s3a code into their own closed-source connector to S3. The best thing you can 
do here is ask them to do so.

URL on S3A in emr has changed BTW, it's now a footnote in 
[https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html]


> Support Multiple aws credentials in order to access multiple Hive on S3 table 
> in spark application 
> ---
>
> Key: SPARK-20153
> URL: https://issues.apache.org/jira/browse/SPARK-20153
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Franck Tago
>Priority: Minor
>
> I need to access multiple hive tables in my spark application where each hive 
> table is 
> 1- an external table with data sitting on S3
> 2- each table is own by a different AWS user so I need to provide different 
> AWS credentials. 
> I am familiar with setting the aws credentials in the hadoop configuration 
> object but that does not really help me because I can only set one pair of 
> (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey )
> From my research , there is no easy or elegant way to do this in spark .
> Why is that ?  
> How do I address this use case?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20799) Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

2018-09-12 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611995#comment-16611995
 ] 

Steve Loughran commented on SPARK-20799:


Update: Hadoop 3.3+ will remove all support for user:secret in S3A URIs because 
it's impossible to keep those secrets out of logs, and logs get everywhere. No 
plans to backport that, though HADOOP-15747 will, so giving people the specific 
hadoop version where this dangerous feature gets pull.

Propose, close as a WONTFIX.

> Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL
> -
>
> Key: SPARK-20799
> URL: https://issues.apache.org/jira/browse/SPARK-20799
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Hadoop 2.8.0 binaries
>Reporter: Jork Zijlstra
>Priority: Minor
>
> We are getting the following exception: 
> {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
> It must be specified manually.{code}
> Combining following factors will cause it:
> - Use S3
> - Use format ORC
> - Don't apply a partitioning on de data
> - Embed AWS credentials in the path
> The problem is in the PartitioningAwareFileIndex def allFiles()
> {code}
> leafDirToChildrenFiles.get(qualifiedPath)
>   .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
>   .getOrElse(Array.empty)
> {code}
> leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
> qualifiedPath contains the path WITH credentials.
> So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no 
> data is read and the schema cannot be defined.
> Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
> details. This is insecure and may be unsupported in future., but this should 
> not mean that it shouldn't work anymore.
> Workaround:
> Move the AWS credentials from the path to the SparkSession
> {code}
> SparkSession.builder
>   .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
>   .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25414:


Assignee: Wenchen Fan  (was: Apache Spark)

> The numInputRows metrics can be incorrect for streaming self-join
> -
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25414:


Assignee: Apache Spark  (was: Wenchen Fan)

> The numInputRows metrics can be incorrect for streaming self-join
> -
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611932#comment-16611932
 ] 

Apache Spark commented on SPARK-25414:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22402

> The numInputRows metrics can be incorrect for streaming self-join
> -
>
> Key: SPARK-25414
> URL: https://issues.apache.org/jira/browse/SPARK-25414
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25414) The numInputRows metrics can be incorrect for streaming self-join

2018-09-12 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25414:
---

 Summary: The numInputRows metrics can be incorrect for streaming 
self-join
 Key: SPARK-25414
 URL: https://issues.apache.org/jira/browse/SPARK-25414
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-25413:
--
Attachment: decimalBoundaryDataHive.csv

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
> sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
> sql("select avg(salary)+10 from hiveBigDecimal").show(false)
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364|
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-25413:
--
Attachment: (was: decimalBoundaryDataHive.csv)

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
> sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
> sql("select avg(salary)+10 from hiveBigDecimal").show(false)
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364|
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-25413:
--
Description: 
sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country 
String, name String, phonetype String, serialname String, salary decimal(27, 
10))row format delimited fields terminated by ','")

sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO 
table hiveBigDecimal")

sql("select avg(salary)+10 from hiveBigDecimal").show(false)

 

Output with 2.3.1

++
|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++
|37800224355780013.75982042536364|

++

OutPut with 2.3.2_RC5
|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++
|37800224355780013.75982042536000   
 |

+

*PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*

 

  was:
sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country 
String, name String, phonetype String, serialname String, salary decimal(27, 
10))row format delimited fields terminated by ','")
 
 sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO 
table hiveBigDecimal")
 
 sql("select avg(salary)+10 from hiveBigDecimal").show(fals

 

Output with 2.3.1

++

|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++

|37800224355780013.75982042536364 |

++

OutPut with 2.3.2_RC5

|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++

|37800224355780013.75982042536000   
 |

+

*PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*

 


> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
> sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
> sql("select avg(salary)+10 from hiveBigDecimal").show(false)
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364|
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 

[jira] [Commented] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611759#comment-16611759
 ] 

Ajith S commented on SPARK-25413:
-

Thank you for reporting the issue sandeep. I think the problem is with 
org.apache.spark.sql.catalyst.expressions.aggregate.AverageLike#sumDataType as 
it increases the precision unnecessarily. Adding a PR to fix this.  

Refer https://github.com/apache/spark/pull/22401

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25413:


Assignee: (was: Apache Spark)

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611758#comment-16611758
 ] 

Apache Spark commented on SPARK-25413:
--

User 'ajithme' has created a pull request for this issue:
https://github.com/apache/spark/pull/22401

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25413:


Assignee: Apache Spark

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-25413:
--
Priority: Blocker  (was: Minor)

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Blocker
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep katta updated SPARK-25413:
--
Attachment: decimalBoundaryDataHive.csv

> [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done
> --
>
> Key: SPARK-25413
> URL: https://issues.apache.org/jira/browse/SPARK-25413
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Csv FIle content
>  
> 1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
> 2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
> 3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
> 4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
> 5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
> 6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
> 7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
> 8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
> 9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
> 10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
> 11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
>Reporter: sandeep katta
>Priority: Minor
> Attachments: decimalBoundaryDataHive.csv
>
>
> sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, 
> country String, name String, phonetype String, serialname String, salary 
> decimal(27, 10))row format delimited fields terminated by ','")
>  
>  sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' 
> INTO table hiveBigDecimal")
>  
>  sql("select avg(salary)+10 from hiveBigDecimal").show(fals
>  
> Output with 2.3.1
> ++
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536364 |
> ++
> OutPut with 2.3.2_RC5
> |(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
> DECIMAL(32,14)))|
> ++
> |37800224355780013.75982042536000 
>    |
> +
> *PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25413) [2.3.2.rc5 Blocker] Precision Value is going for toss when Avg is done

2018-09-12 Thread sandeep katta (JIRA)
sandeep katta created SPARK-25413:
-

 Summary: [2.3.2.rc5 Blocker] Precision Value is going for toss 
when Avg is done
 Key: SPARK-25413
 URL: https://issues.apache.org/jira/browse/SPARK-25413
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
 Environment: Csv FIle content

 

1,23-07-2015,India,aaa1,phone197,ASD69643,1.23457E+16
2,24-07-2015,India,aaa2,phone756,ASD42892,1.23457E+16
3,25-07-2015,India,aaa3,phone1904,ASD37014,1.23457E+16
4,26-07-2015,India,aaa4,phone2435,ASD66902,1.23457E+16
5,27-07-2015,India,aaa5,phone2441,ASD90633,2.23457E+16
6,28-07-2015,India,aaa6,phone294,ASD59961,3.23457E+16
7,29-07-2015,India,aaa7,phone610,ASD14875,4.23457E+16
8,30-07-2015,India,aaa8,phone1848,ASD57308,5.23457E+16
9,18-07-2015,India,aaa9,phone706,ASD86717,6.23457E+16
10,19-07-2015,usa,aaa10,phone685,ASD30505,7.23457E+16
11,18-07-2015,china,aaa11,phone1554,ASD26101,8.23457E+16
Reporter: sandeep katta


sql("create table if not exists hiveBigDecimal(ID Int, date Timestamp, country 
String, name String, phonetype String, serialname String, salary decimal(27, 
10))row format delimited fields terminated by ','")
 
 sql(s"LOAD DATA local inpath '$resourcesPath/decimalBoundaryDataHive.csv' INTO 
table hiveBigDecimal")
 
 sql("select avg(salary)+10 from hiveBigDecimal").show(fals

 

Output with 2.3.1

++

|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++

|37800224355780013.75982042536364 |

++

OutPut with 2.3.2_RC5

|(CAST(avg(salary) AS DECIMAL(32,14)) + CAST(CAST(10 AS DECIMAL(2,0)) AS 
DECIMAL(32,14)))|

++

|37800224355780013.75982042536000   
 |

+

*PS:If I revert SPARK-24957 then 2.3.1 and 2.3.2_rc5 output is same*

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21542) Helper functions for custom Python Persistence

2018-09-12 Thread Peter Knight (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611733#comment-16611733
 ] 

Peter Knight commented on SPARK-21542:
--

Thanks for the replay [~JohnHBauer]. Yes I am using @keyword_only decorator 
exactly like in the stack overflow example you cite. I'll be interested to see 
your code if you get it working. Thanks.

> Helper functions for custom Python Persistence
> --
>
> Key: SPARK-21542
> URL: https://issues.apache.org/jira/browse/SPARK-21542
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Ajay Saini
>Assignee: Ajay Saini
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently, there is no way to easily persist Json-serializable parameters in 
> Python only. All parameters in Python are persisted by converting them to 
> Java objects and using the Java persistence implementation. In order to 
> facilitate the creation of custom Python-only pipeline stages, it would be 
> good to have a Python-only persistence framework so that these stages do not 
> need to be implemented in Scala for persistence. 
> This task involves:
> - Adding implementations for DefaultParamsReadable, DefaultParamsWriteable, 
> DefaultParamsReader, and DefaultParamsWriter in pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-12 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

In {{branch-2.4}}, we still see the performance degradation compared to w/o 
codegen
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 
4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
SPARK-20184: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

codegen = T   2915 / 3204  0.0  
2915001883.0   1.0X
codegen = F   1178 / 1368  0.0  
1178020462.0   2.5X
{code}
 

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org