[jira] [Resolved] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25355.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27422
[https://github.com/apache/spark/pull/27422]

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Gonçalves Rossi Rodrigues
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25355:
-

Assignee: Pedro Gonçalves Rossi Rodrigues

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Gonçalves Rossi Rodrigues
>Priority: Major
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31166) UNION map and other maps should not fail

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31166.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27926
[https://github.com/apache/spark/pull/27926]

> UNION map and other maps should not fail
> 
>
> Key: SPARK-31166
> URL: https://issues.apache.org/jira/browse/SPARK-31166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark

2020-03-16 Thread Saurabh Chawla (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Chawla updated SPARK-30873:
---
Description: 
In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot 
blocks and GCP preemptible VMs) is a planned and informed activity. 
The cloud provider intimates the cluster manager about the possible loss of 
node ahead of time. Few examples is listed here:
a) Spot loss in AWS(2 min before event)
b) GCP Pre-emptible VM loss (30 second before event)
c) AWS Spot block loss with info on termination time (generally few tens of 
minutes before decommission as configured in Yarn)

This JIRA tries to make spark leverage the knowledge of the node loss in 
future, and tries to adjust the scheduling of tasks to minimise the impact on 
the application. 
It is well known that when a host is lost, the executors, its running tasks, 
their caches and also Shuffle data is lost. This could result in wastage of 
compute and other resources.

The focus here is to build a framework for YARN, that can be extended for other 
cluster managers to handle such scenario.

The framework must handle one or more of the following:-
1) Prevent new tasks from starting on any executors on decommissioning Nodes. 
2) Decide to kill the running tasks so that they can be restarted elsewhere 
(assuming they will not complete within the deadline) OR we can allow them to 
continue hoping they will finish within deadline.
3) Clear the shuffle data entry from MapOutputTracker of decommission node 
hostname to prevent the shuffle fetchfailed exception.The most significant 
advantage of unregistering shuffle outputs when Spark schedules the first 
re-attempt to compute the missing blocks, it notices all of the missing blocks 
from decommissioned nodes and recovers in only one attempt. This speeds up the 
recovery process significantly over the scheduled Spark implementation, where 
stages might be rescheduled multiple times to recompute missing shuffles from 
all nodes, and prevent jobs from being stuck for hours failing and recomputing.
4) Prevent the stage to abort due to the fetchfailed exception in case of 
decommissioning of node. In Spark there is number of consecutive stage attempts 
allowed before a stage is aborted.This is controlled by the config 
spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due 
decommissioning of nodes towards stage failure improves the reliability of the 
system.

Main components of change
1) Get the ClusterInfo update from the Resource Manager -> Application Master 
-> Spark Driver.
2) DecommissionTracker, resides inside driver, tracks all the decommissioned 
nodes and take necessary action and state transition.
3) Based on the decommission node list add hooks at code to achieve
 a) No new task on executor
 b) Remove shuffle data mapping info for the node to be decommissioned from the 
mapOutputTracker
 c) Do not count fetchFailure from decommissioned towards stage failure

On the receiving info that node is to be decommissioned, the below action needs 
to be performed by DecommissionTracker on driver:
 * Add the entry of Nodes in DecommissionTracker with termination time and node 
state as "DECOMMISSIONING".
 * Stop assigning any new tasks on executors on the nodes which are candidate 
for decommission. This makes sure slowly as the tasks finish the usage of this 
node would die down.
 * Kill all the executors for the decommissioning nodes after configurable 
period of time, say "spark.graceful.decommission.executor.leasetimePct". This 
killing ensures two things. Firstly, the task failure will be attributed in job 
failure count. Second, avoid generation on more shuffle data on the node that 
will eventually be lost. The node state is set to "EXECUTOR_DECOMMISSIONED". 
 * Mark Shuffle data on the node as unavailable after 
"spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will 
ensure that recomputation of missing shuffle partition is done early, rather 
than reducers failing with a time-consuming FetchFailure. The node state is set 
to "SHUFFLE_DECOMMISSIONED". 
 * Mark Node as Terminated after the termination time. Now the state of the 
node is "TERMINATED".
 * Remove the node entry from Decommission Tracker if the same host name is 
reused.(This is not uncommon in many public cloud environments).

This is the life cycle of the nodes which is decommissioned
DECOMMISSIONING -> EXECUTOR_DECOMMISSIONED -> SHUFFLEDATA_DECOMMISSIONED -> 
TERMINATED.

*Why do we exit the executors decommission before the shuffle decommission 
service? *- There are 2 reasons why we are exiting the executors before the 
shuffle service
a) As per the current logic whenever we received the node decommissioning we 
stop assigning the new task to the executor running on that node. We give some 
time to the task already running on that executor to complete before killing 

[jira] [Resolved] (SPARK-28423) merge Scan and Batch/Stream

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28423.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

> merge Scan and Batch/Stream
> ---
>
> Key: SPARK-28423
> URL: https://issues.apache.org/jira/browse/SPARK-28423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28423) merge Scan and Batch/Stream

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28423:

Affects Version/s: (was: 3.1.0)
   3.0.0

> merge Scan and Batch/Stream
> ---
>
> Key: SPARK-28423
> URL: https://issues.apache.org/jira/browse/SPARK-28423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27856) do not forcibly add cast when inserting table

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27856:

Affects Version/s: (was: 3.1.0)
   3.0.0

> do not forcibly add cast when inserting table
> -
>
> Key: SPARK-27856
> URL: https://issues.apache.org/jira/browse/SPARK-27856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29331) create DS v2 Write at physical plan

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29331:

Affects Version/s: (was: 3.1.0)
   3.0.0

> create DS v2 Write at physical plan
> ---
>
> Key: SPARK-29331
> URL: https://issues.apache.org/jira/browse/SPARK-29331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29331) create DS v2 Write at physical plan

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29331.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

> create DS v2 Write at physical plan
> ---
>
> Key: SPARK-29331
> URL: https://issues.apache.org/jira/browse/SPARK-29331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27856) do not forcibly add cast when inserting table

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27856.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

This is not needed anymore as we have the ansi store assignment rule.

> do not forcibly add cast when inserting table
> -
>
> Key: SPARK-27856
> URL: https://issues.apache.org/jira/browse/SPARK-27856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25083) remove the type erasure hack in data source scan

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25083:

Affects Version/s: (was: 3.1.0)
   3.0.0

> remove the type erasure hack in data source scan
> 
>
> Key: SPARK-25083
> URL: https://issues.apache.org/jira/browse/SPARK-25083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
> should make the type explicit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25083) remove the type erasure hack in data source scan

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25083.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

This is done after we introduce the columnar execution API.

> remove the type erasure hack in data source scan
> 
>
> Key: SPARK-25083
> URL: https://issues.apache.org/jira/browse/SPARK-25083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> It's hacky to pretend a `RDD[ColumnarBatch]` to be a `RDD[InternalRow]`. We 
> should make the type explicit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31019) make it clear that people can deduplicate map keys

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31019:

Affects Version/s: (was: 3.1.0)
   3.0.0

> make it clear that people can deduplicate map keys
> --
>
> Key: SPARK-31019
> URL: https://issues.apache.org/jira/browse/SPARK-31019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31019) make it clear that people can deduplicate map keys

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31019.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> make it clear that people can deduplicate map keys
> --
>
> Key: SPARK-31019
> URL: https://issues.apache.org/jira/browse/SPARK-31019
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26590) make fetch-block-to-disk backward compatible

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26590.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

> make fetch-block-to-disk backward compatible
> 
>
> Key: SPARK-26590
> URL: https://issues.apache.org/jira/browse/SPARK-26590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26590) make fetch-block-to-disk backward compatible

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-26590:

Affects Version/s: (was: 3.1.0)
   3.0.0

> make fetch-block-to-disk backward compatible
> 
>
> Key: SPARK-26590
> URL: https://issues.apache.org/jira/browse/SPARK-26590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29439) DDL commands should not use DataSourceV2Relation

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29439.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

> DDL commands should not use DataSourceV2Relation
> 
>
> Key: SPARK-29439
> URL: https://issues.apache.org/jira/browse/SPARK-29439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29439) DDL commands should not use DataSourceV2Relation

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29439:

Affects Version/s: (was: 3.1.0)
   3.0.0

> DDL commands should not use DataSourceV2Relation
> 
>
> Key: SPARK-29439
> URL: https://issues.apache.org/jira/browse/SPARK-29439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31116) PrquetRowConverter does not follow case sensitivity

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31116:
--
Parent: SPARK-25603
Issue Type: Sub-task  (was: Bug)

> PrquetRowConverter does not follow case sensitivity
> ---
>
> Key: SPARK-31116
> URL: https://issues.apache.org/jira/browse/SPARK-31116
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tae-kyeom, Kim
>Assignee: Tae-kyeom, Kim
>Priority: Blocker
> Fix For: 3.0.0
>
>
> After upgrading spark versrion to 3.0.0-SNAPSHOT. Selecting parquet columns 
> got exception in case insensitive manner. Even we set spark.sql.caseSensitive 
> to false. Reading parquet with case ignored schema (which means columns in 
> parquet and catalyst types are same with respect to case insensitive manner)
>  
> To reproduce error executing follow code cause 
> java.lang.IllegalArgumentException
>  
> {code:java}
> import org.apache.spark.sql.types._
> val path = "/some/temp/path"
> spark
>   .range(1L)
>   .selectExpr("NAMED_STRUCT('lowercase', id, 'camelCase', id + 1) AS 
> StructColumn")
>   .write.parquet(path)
> val caseInsensitiveSchema = new StructType()
>   .add(
> "StructColumn",
> new StructType()
>   .add("LowerCase", LongType)
>   .add("camelcase", LongType))
> spark.read.schema(caseInsensitiveSchema).parquet(path).show(){code}
> Then we got following error.
> {code:java}
> 23:57:09.077 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 215.0 (TID 366)23:57:09.077 ERROR 
> org.apache.spark.executor.Executor: Exception in task 0.0 in stage 215.0 (TID 
> 366)java.lang.IllegalArgumentException: lowercase does not exist. Available: 
> LowerCase, camelcase at 
> org.apache.spark.sql.types.StructType.$anonfun$fieldIndex$1(StructType.scala:306)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147) at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:305) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$1(ParquetRowConverter.scala:182)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:181)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:351)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$1(ParquetRowConverter.scala:185)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:181)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.(ParquetRecordMaterializer.scala:43)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.prepareForRead(ParquetReadSupport.scala:130)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:204)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182)
>  at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:341)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>  at 
> 

[jira] [Updated] (SPARK-29374) do not allow arbitrary v2 catalog as session catalog

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29374:

Affects Version/s: (was: 3.1.0)
   3.0.0

> do not allow arbitrary v2 catalog as session catalog
> 
>
> Key: SPARK-29374
> URL: https://issues.apache.org/jira/browse/SPARK-29374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29374) do not allow arbitrary v2 catalog as session catalog

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29374.
-
  Assignee: (was: Wenchen Fan)
Resolution: Won't Fix

> do not allow arbitrary v2 catalog as session catalog
> 
>
> Key: SPARK-29374
> URL: https://issues.apache.org/jira/browse/SPARK-29374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29927:

Affects Version/s: (was: 3.1.0)
   3.0.0

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2020-03-16 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060614#comment-17060614
 ] 

Wenchen Fan commented on SPARK-29927:
-

This is done in 3.0 as we switch to the java 8 datetime API.

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29927.
-
Fix Version/s: 3.0.0
 Assignee: Maxim Gekk
   Resolution: Fixed

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30307) remove ReusedQueryStageExec

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30307.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> remove ReusedQueryStageExec
> ---
>
> Key: SPARK-30307
> URL: https://issues.apache.org/jira/browse/SPARK-30307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30307) remove ReusedQueryStageExec

2020-03-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30307:

Affects Version/s: (was: 3.1.0)
   3.0.0

> remove ReusedQueryStageExec
> ---
>
> Key: SPARK-30307
> URL: https://issues.apache.org/jira/browse/SPARK-30307
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-03-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-31170:


 Summary: Spark Cli does not respect hive-site.xml and 
spark.sql.warehouse.dir
 Key: SPARK-31170
 URL: https://issues.apache.org/jira/browse/SPARK-31170
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


In Spark CLI, we create a hive CliSessionState and it does not load the 
hive-site.xml. So the configurations in hive-site.xml will not take effects 
like other spark-hive integration apps.

Also, the warehouse directory is not correctly picked. If the `default` 
database does not exist, the CliSessionState will create one during the first 
time it talks to the metastore. The `Location` of the default DB will be 
neither the value of spark.sql.warehousr.dir nor the user-specified value of 
hive.metastore.warehourse.dir, but the default value of 
hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25826) Kerberos Support in Kubernetes resource manager

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25826:
-

Assignee: Ilan Filonenko

> Kerberos Support in Kubernetes resource manager
> ---
>
> Key: SPARK-25826
> URL: https://issues.apache.org/jira/browse/SPARK-25826
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Ilan Filonenko
>Priority: Major
>
> This is the umbrella issue for all Kerberos related tasks with relation to 
> Spark on Kubernetes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25826) Kerberos Support in Kubernetes resource manager

2020-03-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060603#comment-17060603
 ] 

Dongjoon Hyun commented on SPARK-25826:
---

Hi, [~ifilonenko].
Can we resolve this umbrella JIRA by excluding the test JIRA SPARK-25750 from 
here?

> Kerberos Support in Kubernetes resource manager
> ---
>
> Key: SPARK-25826
> URL: https://issues.apache.org/jira/browse/SPARK-25826
> Project: Spark
>  Issue Type: Umbrella
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> This is the umbrella issue for all Kerberos related tasks with relation to 
> Spark on Kubernetes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25355) Support --proxy-user for Spark on K8s

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25355:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25299:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24818) Ensure all the barrier tasks in the same stage are launched together

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24818:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Ensure all the barrier tasks in the same stage are launched together
> 
>
> Key: SPARK-24818
> URL: https://issues.apache.org/jira/browse/SPARK-24818
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> When some executors/hosts are blacklisted, it may happen that only a part of 
> the tasks in the same barrier stage can be launched. We shall detect the case 
> and revert the allocated resource offers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2020-03-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060574#comment-17060574
 ] 

Dongjoon Hyun commented on SPARK-24432:
---

I updated the affected version from 3.0.0 to 3.1.0 because Apache Spark doesn't 
allow backporting new features.

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24432) Add support for dynamic resource allocation

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24432:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23482) R support for robust regression with Huber loss

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23482:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> R support for robust regression with Huber loss
> ---
>
> Key: SPARK-23482
> URL: https://issues.apache.org/jira/browse/SPARK-23482
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.1.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Add support for huber loss for linear regression in R API.  See linked JIRA 
> for change in Scala/Java.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21505) A dynamic join operator to improve the join reliability

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-21505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-21505.
---
Resolution: Incomplete

I close this issue due to long inactivity. Please feel free to reopen this if 
this is still valid.

> A dynamic join operator to improve the join reliability
> ---
>
> Key: SPARK-21505
> URL: https://issues.apache.org/jira/browse/SPARK-21505
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0, 3.0.0
>Reporter: Lin
>Priority: Major
>  Labels: features
>
> As we know, hash join is more efficient than sort merge join. But today hash 
> join is not so widely used because it may fail with OutOfMemory (OOM) error 
> due to limited memory resource, data skew, statistics mis-estimation and so 
> on. For example, if we apply shuffle hash join on an uneven distributed 
> dataset, some partitions might be so large  that we cannot make a Hash table 
> for this particular partition causing OOM error. When OOM happens, current 
> Spark technology will throw an Exception, resulting in job failure. On the 
> other hand, if sort-merge join is used, there will be shuffle, sorting and 
> extra spill, causing the degradation of the join. Considering the efficiency 
> of hash join, we want to propose a fallback mechanism to dynamically use hash 
> join or sort-merge join at runtime at task level to provide a more reliable 
> join operation.
> This new dynamic join operator internally implements the logic of HashJoin, 
> Iterator Reconstruct, Sort, and MergeJoin.  We show the process of this 
> dynamic join method as following:
> HashJoin: We start from building  Hash table on one side of join partitions. 
> If Hash table is built successfully, it would be the same as the current 
> ShuffledHashJoin operator. 
> Sort: If we fail to build Hash table due to the large partition size, we do 
> SortMergeJoin only on this partition. But we need to rebuild the   When OOM 
> happens, a Hash table corresponding to partial part of this partition has 
> been built successfully (e.g. first 4000 rows of RDD), and the iterator of 
> this partition is now pointing to the 4001st row of partition. We reuse this 
> hash table to reconstruct the iterator for the first 4000 rows and 
> concatenate  with the rest rows of this partition so that we can rebuild this 
> partition completely. On this re-built partition, we apply sorting based on 
> key values.
> MergeJoin: After getting two sorted Iterators, we perform regular merge join 
> against them and emits the records to downstream operators.
> Iterator Reconstruct:  BytesToBytesMap has to be spilled to disk to release 
> the memory for other operators, such as Sort, Join, etc. In addition, it has 
> to be converted to Iterator, so that it can be concatenated with remaining 
> items in the original iterator that is used to build the hash table.
> Meta Data Population: Necessary metadata, such as sorting keys, jointype, 
> etc,  has to be populated, so that they are used for potential Sort and 
> MergeJoin operator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23443) Spark with Glue as external catalog

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23443:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31148) Switching inner join statement yield different result

2020-03-16 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060572#comment-17060572
 ] 

L. C. Hsieh commented on SPARK-31148:
-

I see. This seems to be Amazon EMR Spark customization. For such vendor 
customization, you should ask the customer help from Amazon.

> Switching inner join statement yield different result
> -
>
> Key: SPARK-31148
> URL: https://issues.apache.org/jira/browse/SPARK-31148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Tanapol Nearunchorn
>Priority: Major
>
> Given this query
>  
> {code:java}
> select r.id
> from sqoop_wongnai.wongnai__w_ref r
> inner join sqoop_wongnai.wongnai__w_photo p on p.root_referrer_id = r.id
> inner join tests.restaurants_id_export_photos i on i.restaurant_id = 
> r.restaurant_id
> where r.restaurant_id = 360
> {code}
> Spark return nothing except when I switch inner join statement between line 3 
> and 4 there are results returned.
> I guarantee that we have data that match the join condition above.
> Here's result from explain extend of above query:
>  
> {code:java}
> == Parsed Logical Plan ==
> 'Project ['r.id]
> +- 'Filter ('r.restaurant_id = 360)
>+- 'Join Inner, ('i.restaurant_id = 'r.restaurant_id)
>   :- 'Join Inner, ('p.root_referrer_id = 'r.id)
>   :  :- 'SubqueryAlias `r`
>   :  :  +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_ref`
>   :  +- 'SubqueryAlias `p`
>   : +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_photo`
>   +- 'SubqueryAlias `i`
>  +- 'UnresolvedRelation `tests`.`restaurants_id_export_photos`
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#1834834L]
> +- Filter (restaurant_id#1834836L = cast(360 as bigint))
>+- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>   :- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>   :  :- SubqueryAlias `r`
>   :  :  +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_ref`
>   :  : +- HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#1834834L, 
> restaurant_id#1834836L,]
>   :  +- SubqueryAlias `p`
>   : +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_photo`
>   :+- HiveTableRelation `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#1834858L, ... 
> 14 more fields]
>   +- SubqueryAlias `i`
>  +- SubqueryAlias `tests`.`restaurants_id_export_photos`
> +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [restaurant_id#1834876L]== Optimized Logical Plan ==
> Project [id#1834834L]
> +- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>:- Project [id#1834834L, restaurant_id#1834836L]
>:  +- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>: :- Project [id#1834834L, restaurant_id#1834836L]
>: :  +- Filter ((isnotnull(restaurant_id#1834836L) && 
> (restaurant_id#1834836L = 360)) && isnotnull(id#1834834L))
>: : +- InMemoryRelation [id#1834834L, restaurant_id#1834836L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :   +- Scan hive sqoop_wongnai.wongnai__w_ref [id#87357L, 
> restaurant_id#87359L], HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#87357L, 
> restaurant_id#87359L]
>: +- Project [root_referrer_id#1834858L]
>:+- Filter (isnotnull(root_referrer_id#1834858L) && 
> bloomfilter#1835088 of [id#1834834L] filtering [root_referrer_id#1834858L])
>:   +- InMemoryRelation [root_referrer_id#1834858L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: +- Scan hive sqoop_wongnai.wongnai__w_photo 
> [root_referrer_id#82906L], HiveTableRelation 
> `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#82906L]
>+- Filter ((isnotnull(restaurant_id#1834876L) && (restaurant_id#1834876L = 
> 360)) && bloomfilter#1835087 of [restaurant_id#1834836L] filtering 
> [restaurant_id#1834876L])
>   +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [restaurant_id#1834876L]
> == Physical Plan ==
> *(7) Project [id#1834834L]
> +- *(7) SortMergeJoin [restaurant_id#1834836L], [restaurant_id#1834876L], 
> Inner
>:- *(5) Sort [restaurant_id#1834836L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(restaurant_id#1834836L, 200)
>: +- *(4) Project [id#1834834L, restaurant_id#1834836L]
>:+- *(4) SortMergeJoin [id#1834834L], [root_referrer_id#1834858L], 
> Inner
>:   :- 

[jira] [Resolved] (SPARK-31148) Switching inner join statement yield different result

2020-03-16 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-31148.
-
Resolution: Won't Fix

> Switching inner join statement yield different result
> -
>
> Key: SPARK-31148
> URL: https://issues.apache.org/jira/browse/SPARK-31148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Tanapol Nearunchorn
>Priority: Major
>
> Given this query
>  
> {code:java}
> select r.id
> from sqoop_wongnai.wongnai__w_ref r
> inner join sqoop_wongnai.wongnai__w_photo p on p.root_referrer_id = r.id
> inner join tests.restaurants_id_export_photos i on i.restaurant_id = 
> r.restaurant_id
> where r.restaurant_id = 360
> {code}
> Spark return nothing except when I switch inner join statement between line 3 
> and 4 there are results returned.
> I guarantee that we have data that match the join condition above.
> Here's result from explain extend of above query:
>  
> {code:java}
> == Parsed Logical Plan ==
> 'Project ['r.id]
> +- 'Filter ('r.restaurant_id = 360)
>+- 'Join Inner, ('i.restaurant_id = 'r.restaurant_id)
>   :- 'Join Inner, ('p.root_referrer_id = 'r.id)
>   :  :- 'SubqueryAlias `r`
>   :  :  +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_ref`
>   :  +- 'SubqueryAlias `p`
>   : +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_photo`
>   +- 'SubqueryAlias `i`
>  +- 'UnresolvedRelation `tests`.`restaurants_id_export_photos`
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#1834834L]
> +- Filter (restaurant_id#1834836L = cast(360 as bigint))
>+- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>   :- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>   :  :- SubqueryAlias `r`
>   :  :  +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_ref`
>   :  : +- HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#1834834L, 
> restaurant_id#1834836L,]
>   :  +- SubqueryAlias `p`
>   : +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_photo`
>   :+- HiveTableRelation `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#1834858L, ... 
> 14 more fields]
>   +- SubqueryAlias `i`
>  +- SubqueryAlias `tests`.`restaurants_id_export_photos`
> +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [restaurant_id#1834876L]== Optimized Logical Plan ==
> Project [id#1834834L]
> +- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>:- Project [id#1834834L, restaurant_id#1834836L]
>:  +- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>: :- Project [id#1834834L, restaurant_id#1834836L]
>: :  +- Filter ((isnotnull(restaurant_id#1834836L) && 
> (restaurant_id#1834836L = 360)) && isnotnull(id#1834834L))
>: : +- InMemoryRelation [id#1834834L, restaurant_id#1834836L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :   +- Scan hive sqoop_wongnai.wongnai__w_ref [id#87357L, 
> restaurant_id#87359L], HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#87357L, 
> restaurant_id#87359L]
>: +- Project [root_referrer_id#1834858L]
>:+- Filter (isnotnull(root_referrer_id#1834858L) && 
> bloomfilter#1835088 of [id#1834834L] filtering [root_referrer_id#1834858L])
>:   +- InMemoryRelation [root_referrer_id#1834858L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: +- Scan hive sqoop_wongnai.wongnai__w_photo 
> [root_referrer_id#82906L], HiveTableRelation 
> `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#82906L]
>+- Filter ((isnotnull(restaurant_id#1834876L) && (restaurant_id#1834876L = 
> 360)) && bloomfilter#1835087 of [restaurant_id#1834836L] filtering 
> [restaurant_id#1834876L])
>   +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [restaurant_id#1834876L]
> == Physical Plan ==
> *(7) Project [id#1834834L]
> +- *(7) SortMergeJoin [restaurant_id#1834836L], [restaurant_id#1834876L], 
> Inner
>:- *(5) Sort [restaurant_id#1834836L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(restaurant_id#1834836L, 200)
>: +- *(4) Project [id#1834834L, restaurant_id#1834836L]
>:+- *(4) SortMergeJoin [id#1834834L], [root_referrer_id#1834858L], 
> Inner
>:   :- *(2) Sort [id#1834834L ASC NULLS FIRST], false, 0
>:   :  +- Exchange hashpartitioning(id#1834834L, 200)
>:   : +- *(1) 

[jira] [Resolved] (SPARK-30569) Add DSL functions invoking percentile_approx

2020-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30569.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27278
[https://github.com/apache/spark/pull/27278]

> Add DSL functions invoking percentile_approx
> 
>
> Key: SPARK-30569
> URL: https://issues.apache.org/jira/browse/SPARK-30569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Since 2.1  (SPARK-16283)  Spark provides {{percentile_approx}}  SQL function, 
> however it is not exposed through DSL, with exception to 
> `DataFrameStatFunctions.approxQuantile` (which is eager and doesn't support 
> grouping).
> It would be useful to have it exposed through {{o.a.s.sql.functions}} / 
> {{pyspark.sql.functions}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30569) Add DSL functions invoking percentile_approx

2020-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30569:


Assignee: Maciej Szymkiewicz

> Add DSL functions invoking percentile_approx
> 
>
> Key: SPARK-30569
> URL: https://issues.apache.org/jira/browse/SPARK-30569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Since 2.1  (SPARK-16283)  Spark provides {{percentile_approx}}  SQL function, 
> however it is not exposed through DSL, with exception to 
> `DataFrameStatFunctions.approxQuantile` (which is eager and doesn't support 
> grouping).
> It would be useful to have it exposed through {{o.a.s.sql.functions}} / 
> {{pyspark.sql.functions}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23367) Include python document style checking

2020-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-23367:
--
  Assignee: (was: Rekha Joshi)

Reverted at SPARK-31155 and https://github.com/apache/spark/pull/27912

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Minor
> Fix For: 3.0.0
>
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31155) Remove pydocstyle tests

2020-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31155.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27912
[https://github.com/apache/spark/pull/27912]

> Remove pydocstyle tests
> ---
>
> Key: SPARK-31155
> URL: https://issues.apache.org/jira/browse/SPARK-31155
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>
> pydocstyle tests have been running neither on Jenkins nor on Github.
> We also seem to be in a [bad 
> place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to 
> re-enable them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31155) Remove pydocstyle tests

2020-03-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31155:


Assignee: Nicholas Chammas

> Remove pydocstyle tests
> ---
>
> Key: SPARK-31155
> URL: https://issues.apache.org/jira/browse/SPARK-31155
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> pydocstyle tests have been running neither on Jenkins nor on Github.
> We also seem to be in a [bad 
> place|https://github.com/apache/spark/pull/27912#issuecomment-599167117] to 
> re-enable them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31148) Switching inner join statement yield different result

2020-03-16 Thread Tanapol Nearunchorn (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060567#comment-17060567
 ] 

Tanapol Nearunchorn commented on SPARK-31148:
-

I use spark on Amazon EMR 5.29.0 and I found that bloom filter is enabled by 
default.

https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/emr-spark-performance.html

EMR set spark.sql.bloomFilterJoin.enabled to be true by default.

> Switching inner join statement yield different result
> -
>
> Key: SPARK-31148
> URL: https://issues.apache.org/jira/browse/SPARK-31148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Tanapol Nearunchorn
>Priority: Major
>
> Given this query
>  
> {code:java}
> select r.id
> from sqoop_wongnai.wongnai__w_ref r
> inner join sqoop_wongnai.wongnai__w_photo p on p.root_referrer_id = r.id
> inner join tests.restaurants_id_export_photos i on i.restaurant_id = 
> r.restaurant_id
> where r.restaurant_id = 360
> {code}
> Spark return nothing except when I switch inner join statement between line 3 
> and 4 there are results returned.
> I guarantee that we have data that match the join condition above.
> Here's result from explain extend of above query:
>  
> {code:java}
> == Parsed Logical Plan ==
> 'Project ['r.id]
> +- 'Filter ('r.restaurant_id = 360)
>+- 'Join Inner, ('i.restaurant_id = 'r.restaurant_id)
>   :- 'Join Inner, ('p.root_referrer_id = 'r.id)
>   :  :- 'SubqueryAlias `r`
>   :  :  +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_ref`
>   :  +- 'SubqueryAlias `p`
>   : +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_photo`
>   +- 'SubqueryAlias `i`
>  +- 'UnresolvedRelation `tests`.`restaurants_id_export_photos`
> == Analyzed Logical Plan ==
> id: bigint
> Project [id#1834834L]
> +- Filter (restaurant_id#1834836L = cast(360 as bigint))
>+- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>   :- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>   :  :- SubqueryAlias `r`
>   :  :  +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_ref`
>   :  : +- HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#1834834L, 
> restaurant_id#1834836L,]
>   :  +- SubqueryAlias `p`
>   : +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_photo`
>   :+- HiveTableRelation `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#1834858L, ... 
> 14 more fields]
>   +- SubqueryAlias `i`
>  +- SubqueryAlias `tests`.`restaurants_id_export_photos`
> +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [restaurant_id#1834876L]== Optimized Logical Plan ==
> Project [id#1834834L]
> +- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L)
>:- Project [id#1834834L, restaurant_id#1834836L]
>:  +- Join Inner, (root_referrer_id#1834858L = id#1834834L)
>: :- Project [id#1834834L, restaurant_id#1834836L]
>: :  +- Filter ((isnotnull(restaurant_id#1834836L) && 
> (restaurant_id#1834836L = 360)) && isnotnull(id#1834834L))
>: : +- InMemoryRelation [id#1834834L, restaurant_id#1834836L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: :   +- Scan hive sqoop_wongnai.wongnai__w_ref [id#87357L, 
> restaurant_id#87359L], HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#87357L, 
> restaurant_id#87359L]
>: +- Project [root_referrer_id#1834858L]
>:+- Filter (isnotnull(root_referrer_id#1834858L) && 
> bloomfilter#1835088 of [id#1834834L] filtering [root_referrer_id#1834858L])
>:   +- InMemoryRelation [root_referrer_id#1834858L], 
> StorageLevel(disk, memory, deserialized, 1 replicas)
>: +- Scan hive sqoop_wongnai.wongnai__w_photo 
> [root_referrer_id#82906L], HiveTableRelation 
> `sqoop_wongnai`.`wongnai__w_photo`, 
> org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#82906L]
>+- Filter ((isnotnull(restaurant_id#1834876L) && (restaurant_id#1834876L = 
> 360)) && bloomfilter#1835087 of [restaurant_id#1834836L] filtering 
> [restaurant_id#1834876L])
>   +- HiveTableRelation `tests`.`restaurants_id_export_photos`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [restaurant_id#1834876L]
> == Physical Plan ==
> *(7) Project [id#1834834L]
> +- *(7) SortMergeJoin [restaurant_id#1834836L], [restaurant_id#1834876L], 
> Inner
>:- *(5) Sort [restaurant_id#1834836L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(restaurant_id#1834836L, 200)
>: +- *(4) Project [id#1834834L, 

[jira] [Updated] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Nhanduc updated SPARK-31169:
---
Labels: MLLib, RandomForest SparkML  (was: )

> Random Forest in SparkML 2.3.3 vs 2.4.x
> ---
>
> Key: SPARK-31169
> URL: https://issues.apache.org/jira/browse/SPARK-31169
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.3, 2.4.0, 2.4.3
>Reporter: Nguyen Nhanduc
>Priority: Major
>  Labels: MLLib,, RandomForest, SparkML
> Attachments: spark233.jpg, spark240.jpg, spark243.jpg
>
>
> Hi all,
> When I trained the model with the Random Forest algorithm, I got different 
> results in different versions of spark, the same input, label ratio, 
> hyperparameter for all training. Detailed training results in the attached 
> file. Model training results with spark 2.3.3 are much better, so I want to 
> ask if there have been any changes to the random forest (or other algorithms) 
> in mllib?
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Nhanduc updated SPARK-31169:
---
Attachment: spark243.jpg

> Random Forest in SparkML 2.3.3 vs 2.4.x
> ---
>
> Key: SPARK-31169
> URL: https://issues.apache.org/jira/browse/SPARK-31169
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.3, 2.4.0, 2.4.3
>Reporter: Nguyen Nhanduc
>Priority: Major
> Attachments: spark233.jpg, spark240.jpg, spark243.jpg
>
>
> Hi all,
> When I trained the model with the Random Forest algorithm, I got different 
> results in different versions of spark, the same input, label ratio, 
> hyperparameter for all training. Detailed training results in the attached 
> file. Model training results with spark 2.3.3 are much better, so I want to 
> ask if there have been any changes to the random forest (or other algorithms) 
> in mllib?
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Nhanduc updated SPARK-31169:
---
Attachment: (was: spark244.jpg)

> Random Forest in SparkML 2.3.3 vs 2.4.x
> ---
>
> Key: SPARK-31169
> URL: https://issues.apache.org/jira/browse/SPARK-31169
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.3, 2.4.0, 2.4.3
>Reporter: Nguyen Nhanduc
>Priority: Major
> Attachments: spark233.jpg, spark240.jpg, spark243.jpg
>
>
> Hi all,
> When I trained the model with the Random Forest algorithm, I got different 
> results in different versions of spark, the same input, label ratio, 
> hyperparameter for all training. Detailed training results in the attached 
> file. Model training results with spark 2.3.3 are much better, so I want to 
> ask if there have been any changes to the random forest (or other algorithms) 
> in mllib?
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Nhanduc updated SPARK-31169:
---
Description: 
Hi all,

When I trained the model with the Random Forest algorithm, I got different 
results in different versions of spark, the same input, label ratio, 
hyperparameter for all training. Detailed training results in the attached 
file. Model training results with spark 2.3.3 are much better, so I want to ask 
if there have been any changes to the random forest (or other algorithms) in 
mllib?

Many thanks.

  was:
Hi all,

When I trained the model with the Random Forest algorithm, I got different 
results in different versions of spark, the same input, label ratio, 
hyperparameter for all training. Detailed training results in the attached file.


> Random Forest in SparkML 2.3.3 vs 2.4.x
> ---
>
> Key: SPARK-31169
> URL: https://issues.apache.org/jira/browse/SPARK-31169
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.3, 2.4.0, 2.4.3
>Reporter: Nguyen Nhanduc
>Priority: Major
> Attachments: spark233.jpg, spark240.jpg, spark244.jpg
>
>
> Hi all,
> When I trained the model with the Random Forest algorithm, I got different 
> results in different versions of spark, the same input, label ratio, 
> hyperparameter for all training. Detailed training results in the attached 
> file. Model training results with spark 2.3.3 are much better, so I want to 
> ask if there have been any changes to the random forest (or other algorithms) 
> in mllib?
> Many thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nguyen Nhanduc updated SPARK-31169:
---
Attachment: spark233.jpg
spark240.jpg
spark244.jpg

> Random Forest in SparkML 2.3.3 vs 2.4.x
> ---
>
> Key: SPARK-31169
> URL: https://issues.apache.org/jira/browse/SPARK-31169
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.3, 2.4.0, 2.4.3
>Reporter: Nguyen Nhanduc
>Priority: Major
> Attachments: spark233.jpg, spark240.jpg, spark244.jpg
>
>
> Hi all,
> When I trained the model with the Random Forest algorithm, I got different 
> results in different versions of spark, the same input, label ratio, 
> hyperparameter for all training. Detailed training results in the attached 
> file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31169) Random Forest in SparkML 2.3.3 vs 2.4.x

2020-03-16 Thread Nguyen Nhanduc (Jira)
Nguyen Nhanduc created SPARK-31169:
--

 Summary: Random Forest in SparkML 2.3.3 vs 2.4.x
 Key: SPARK-31169
 URL: https://issues.apache.org/jira/browse/SPARK-31169
 Project: Spark
  Issue Type: Question
  Components: ML
Affects Versions: 2.4.3, 2.4.0, 2.3.3
Reporter: Nguyen Nhanduc


Hi all,

When I trained the model with the Random Forest algorithm, I got different 
results in different versions of spark, the same input, label ratio, 
hyperparameter for all training. Detailed training results in the attached file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31162:
-
Affects Version/s: (was: 2.4.5)
   3.1.0

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060540#comment-17060540
 ] 

Takeshi Yamamuro commented on SPARK-31162:
--

I've checked the original PR ([https://github.com/apache/spark/pull/10498]) to 
implement the buckeBy and I think that comment says nothing about the 
compatibility of Hive's bucketing schema. But, I think the comment is a bit 
confusing, so I'll fix it later.

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29929) Allow V2 Datasources to require a data distribution

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29929:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Allow V2 Datasources to require a data distribution
> ---
>
> Key: SPARK-29929
> URL: https://issues.apache.org/jira/browse/SPARK-29929
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Andrew K Long
>Priority: Major
>
> Currently users are unable to specify that their v2 Datasource requires a 
> particular Distribution before inserting data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26185) add weightCol in python MulticlassClassificationEvaluator

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26185:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> add weightCol in python MulticlassClassificationEvaluator
> -
>
> Key: SPARK-26185
> URL: https://issues.apache.org/jira/browse/SPARK-26185
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-24101 added weightCol in 
> MulticlassClassificationEvaluator.scala. This Jira will add weightCol in 
> python version of MulticlassClassificationEvaluator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28629) Capture the missing rules in HiveSessionStateBuilder

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28629:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Capture the missing rules in HiveSessionStateBuilder
> 
>
> Key: SPARK-28629
> URL: https://issues.apache.org/jira/browse/SPARK-28629
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> A general mistake for new contributors is to forget adding the corresponding 
> rules into the extended extendedResolutionRules, postHocResolutionRules, 
> extendedCheckRules in HiveSessionStateBuilder. We need to avoid missing the 
> rules or capture them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30444) The same job will be computated for many times when using Dataset.show()

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30444:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> The same job will be computated for many times when using Dataset.show()
> 
>
> Key: SPARK-30444
> URL: https://issues.apache.org/jira/browse/SPARK-30444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: CacheCheck
>Priority: Major
>
> When I run the example sql.SparkSQLExample, df.show() at line 60 would 
> trigger an action. On WebUI, I noticed that this API creates 5 jobs, all of 
> which have the same lineage graph with the same RDDs and the same call 
> stacks. That means Spark recomputates the job for 5 times. But strangely, 
> sqlDF.show() at line 123 only creates 1 job.
> I don't know what happened at show() at line 60.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23609) Test EnsureRequirements's test cases to eliminate ShuffleExchange while is not expected

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23609:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Test EnsureRequirements's test cases to eliminate ShuffleExchange while is 
> not expected
> ---
>
> Key: SPARK-23609
> URL: https://issues.apache.org/jira/browse/SPARK-23609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: caoxuewen
>Priority: Minor
>
> Currently, In testing EnsureRequirements's test cases to eliminate 
> ShuffleExchange, The test code is not in conformity with the purpose of the 
> test.These test cases are as follows:
> 1、test("EnsureRequirements eliminates Exchange if child has same 
> partitioning")
>    The checking condition is that there is no ShuffleExchange in the physical 
> plan. = = 2 It's not accurate here.
> 2、test("EnsureRequirements does not eliminate Exchange with different 
> partitioning")
>    The purpose of the test is to not eliminate ShuffleExchange, but its test 
> code is the same as test("EnsureRequirements eliminates Exchange if child has 
> same partitioning")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30469) Partition columns should not be involved when calculating sizeInBytes of Project logical plan

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30469:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Partition columns should not be involved when calculating sizeInBytes of 
> Project logical plan
> -
>
> Key: SPARK-30469
> URL: https://issues.apache.org/jira/browse/SPARK-30469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> When getting the statistics of a Project logical plan, if CBO not enabled, 
> Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate 
> the size in bytes, which will compute the ratio of the row size of the 
> project plan and its child plan.
> And the row size is computed based on the output attributes (columns). 
> Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
> columns of hive table as well, which is not reasonable, because partition 
> columns actually does not account for sizeInBytes.
> This may make the sizeInBytes not accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29927:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30273) Add melt() function

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30273:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add melt() function
> ---
>
> Key: SPARK-30273
> URL: https://issues.apache.org/jira/browse/SPARK-30273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: PySpark, feature
>
> - Adds melt() functionality based on 
> [this]([https://stackoverflow.com/a/41673644/12474509)] implementation
>  
> [https://github.com/apache/spark/pull/26912/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27093) Honor ParseMode in AvroFileFormat

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27093:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Honor ParseMode in AvroFileFormat
> -
>
> Key: SPARK-27093
> URL: https://issues.apache.org/jira/browse/SPARK-27093
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.1.0
>Reporter: Tim Cerexhe
>Priority: Major
>
> The Avro reader is missing the ability to handle malformed or truncated files 
> like the JSON reader. Currently it throws exceptions when it encounters any 
> bad or truncated record in an Avro file, causing the entire Spark job to fail 
> from a single dodgy file. 
> Ideally the AvroFileFormat would accept a Permissive or DropMalformed 
> ParseMode like Spark's JSON format. This would enable the the Avro reader to 
> drop bad records and continue processing the good records rather than abort 
> the entire job. 
> Obviously the default could remain as FailFastMode, which is the current 
> effective behavior, so this wouldn’t break any existing users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29439) DDL commands should not use DataSourceV2Relation

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29439:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> DDL commands should not use DataSourceV2Relation
> 
>
> Key: SPARK-29439
> URL: https://issues.apache.org/jira/browse/SPARK-29439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26987) Add a new method to RowFactory: Row with schema

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26987:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add a new method to RowFactory: Row with schema
> ---
>
> Key: SPARK-26987
> URL: https://issues.apache.org/jira/browse/SPARK-26987
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> In Java API, RowFactory is supposed to be the official way to create a Row. 
> There's only "create()" method which doesn't contain schema in Row, hence 
> Java API only guarantee the usage of Row as "access by index", though "access 
> by column name" works in some query according to the query plan.
>  
> We could add "createWithSchema()" method in RowFactory so that end users can 
> access by column name for Row consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30046) linalg parity between scala and py sides

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30046:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> linalg parity between scala and py sides 
> -
>
> Key: SPARK-30046
> URL: https://issues.apache.org/jira/browse/SPARK-30046
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I found that ml.linalg and pyspark.ml.linalg have diffecent methods, for 
> example:
> in scala's 'ml.linalg':   Matrices.zero(3, 3) to create an empty matrix,
> pyspark.ml.linalg does not have this method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30953) InsertAdaptiveSparkPlan should also skip v2 command

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30953:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> InsertAdaptiveSparkPlan should also skip v2 command
> ---
>
> Key: SPARK-30953
> URL: https://issues.apache.org/jira/browse/SPARK-30953
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> InsertAdaptiveSparkPlan should also skip v2 command as we did for v1 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28419:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> A patch for SparkThriftServer support multi-tenant authentication
> -
>
> Key: SPARK-28419
> URL: https://issues.apache.org/jira/browse/SPARK-28419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24611) Clean up OutputCommitCoordinator

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24611:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Clean up OutputCommitCoordinator
> 
>
> Key: SPARK-24611
> URL: https://issues.apache.org/jira/browse/SPARK-24611
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
>
> This is a follow up to SPARK-24589, to address some issues brought up during 
> review of the change:
> - the DAGScheduler registers all stages with the coordinator, when at first 
> view only result stages need to. That would save memory in the driver.
> - the coordinator can track task IDs instead of the internal "TaskIdentifier" 
> type it uses; that would also save some memory, and also be more accurate.
> - {{TaskCommitDenied}} currently has a "job ID" when it's really a stage ID, 
> and it contains the task attempt number, when it should probably have the 
> task ID instead (like above).
> The latter is an API breakage (in a class tagged as developer API, but 
> still), and also affects data written to event logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27936) Support local dependency uploading from --py-files

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27936:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support local dependency uploading from --py-files
> --
>
> Key: SPARK-27936
> URL: https://issues.apache.org/jira/browse/SPARK-27936
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Erik Erlandson
>Priority: Major
>
> Support python dependency uploads, as in SPARK-23153



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29952) Pandas UDFs do not support vectors as input

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29952:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Pandas UDFs do not support vectors as input
> ---
>
> Key: SPARK-29952
> URL: https://issues.apache.org/jira/browse/SPARK-29952
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: koba
>Priority: Minor
>
> Currently, pandas udfs do not support columns of vectors as input. Only 
> columns of arrays. This means that feature columns that contain Dense- or 
> Sparse vectors generated by CountVectorizer for example are not supported by 
> pandas udfs out of the box. One needs to convert vectors into arrays first. 
> It was not documented anywhere and I had to find out by trial and error. 
> Below is an example. 
>  
> {code:java}
> from pyspark.sql.functions import udf, pandas_udf
> import pyspark.sql.functions as F
> from pyspark.ml.linalg import DenseVector, Vectors, VectorUDT
> from pyspark.sql.types import *
> import numpy as np
> columns = ['features','id']
> vals = [
>  (DenseVector([1, 2, 1, 3]),1),
>  (DenseVector([2, 2, 1, 3]),2)
> ]
> sdf = spark.createDataFrame(vals,columns)
> sdf.show()
> +-+---+
> | features| id|
> +-+---+
> |[1.0,2.0,1.0,3.0]|  1|
> |[2.0,2.0,1.0,3.0]|  2|
> +-+---+
> {code}
> {code:java}
> @udf(returnType=ArrayType(FloatType()))
> def vector_to_array(v):
> # convert column of vectors into column of arrays
> a = v.values.tolist()
> return a
> sdf = sdf.withColumn('features_array',vector_to_array('features'))
> sdf.show()
> sdf.dtypes
> +-+---++
> | features| id|  features_array|
> +-+---++
> |[1.0,2.0,1.0,3.0]|  1|[1.0, 2.0, 1.0, 3.0]|
> |[2.0,2.0,1.0,3.0]|  2|[2.0, 2.0, 1.0, 3.0]|
> +-+---++
> [('features', 'vector'), ('id', 'bigint'), ('features_array', 'array')]
> {code}
> {code:java}
> import pandas as pd
> @pandas_udf(LongType())
> def _pandas_udf(v):
> res = []
> for i in v:
> res.append(i.mean())
> return pd.Series(res)
> sdf.select(_pandas_udf('features_array')).show()
> +---+
> |_pandas_udf(features_array)|
> +---+
> |  1|
> |  2|
> +---+
> {code}
> But If I use the vector column I get the following error.
> {code:java}
> sdf.select(_pandas_udf('features')).show()
> ---
> Py4JJavaError Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 sdf.select(_pandas_udf('features')).show()
> ~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/dataframe.py
>  in show(self, n, truncate, vertical)
> 376 """
> 377 if isinstance(truncate, bool) and truncate:
> --> 378 print(self._jdf.showString(n, 20, vertical))
> 379 else:
> 380 print(self._jdf.showString(n, int(truncate), vertical))
> ~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1255 answer = self.gateway_client.send_command(command)
>1256 return_value = get_return_value(
> -> 1257 answer, self.gateway_client, self.target_id, self.name)
>1258 
>1259 for temp_arg in temp_args:
> ~/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages/pyspark/sql/utils.py
>  in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> ~/.pyenv/versions/3.4.4/lib/python3.4/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 326 raise Py4JJavaError(
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> 330 raise Py4JError(
> Py4JJavaError: An error occurred while calling o2635.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 156.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 156.0 (TID 606, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unsupported data type: 
> struct,values:array>
>   at 
> 

[jira] [Updated] (SPARK-30548) Cached blockInfo in BlockMatrix.scala is never released

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30548:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Cached blockInfo in BlockMatrix.scala is never released
> ---
>
> Key: SPARK-30548
> URL: https://issues.apache.org/jira/browse/SPARK-30548
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.1.0
>Reporter: CacheCheck
>Priority: Minor
>
> The private variable _blockInfo_ in mllib.linalg.distribtued.BlockMatrix is 
> never unpersisted since a BlockMatrix instance is created.
> {code:scala}
>  private lazy val blockInfo = blocks.mapValues(block => (block.numRows, 
> block.numCols)).cache()
> {code}
> I think we should add an API to unpersist this variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30567) setDelegateCatalog should be called if catalog has implemented CatalogExtension

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30567:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> setDelegateCatalog should be called if catalog has implemented 
> CatalogExtension
> ---
>
> Key: SPARK-30567
> URL: https://issues.apache.org/jira/browse/SPARK-30567
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: yu jiantao
>Priority: Major
>
> CatalogManager.catalog calls Catalogs.load to load a catalog if it is not 
> 'spark_catalog' . If the catalog has implemented CatalogExtension, 
> setDelegateCatalog is not called when the catalog is loaded, which is not 
> like that we have done for v2SessionCatalog, and that makes a confusion for 
> customized session catalog, like iceberg SparkSessionCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30527) Add IsNotNull filter when use In, InSet and InSubQuery

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30527:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add IsNotNull filter when use In, InSet and InSubQuery
> --
>
> Key: SPARK-30527
> URL: https://issues.apache.org/jira/browse/SPARK-30527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25732:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27655) Persistent the table statistics to metadata after fall back to hdfs

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27655:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Persistent the table statistics to metadata after fall back to hdfs
> ---
>
> Key: SPARK-27655
> URL: https://issues.apache.org/jira/browse/SPARK-27655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: disableFallBackToHdfs.png, enableFallBackToHdfs.png
>
>
> It's a real case. We need to join many times from different tables, but the 
> statistics of some tables are incorrect. So this job need 43 min,  It only 
> need 3.7min after set 
> {{spark.sql.statistics.persistentStatsAfterFallBack=true}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27602) SparkSQL CBO can't get true size of partition table after partition pruning

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27602:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> SparkSQL CBO can't get true size of partition table after partition pruning
> ---
>
> Key: SPARK-27602
> URL: https://issues.apache.org/jira/browse/SPARK-27602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
> Attachments: image-2019-05-05-11-46-41-240.png
>
>
> When I want to do extract a cost of one sql for myself's cost framework,  I 
> found that CBO  can't get true size of partition table  since when partition 
> pruning is true. we just need corresponding partition's size. It just use the 
> tables's statistic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29250) Upgrade to Hadoop 3.2.1

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29250:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Upgrade to Hadoop 3.2.1
> ---
>
> Key: SPARK-29250
> URL: https://issues.apache.org/jira/browse/SPARK-29250
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30059) Stop AsyncEventQueue when interrupted in dispatch

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30059:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Stop AsyncEventQueue when interrupted in dispatch
> -
>
> Key: SPARK-30059
> URL: https://issues.apache.org/jira/browse/SPARK-30059
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Wang Shuo
>Priority: Major
>
> SPARK-24309 stop AsyncEventQueue when interrupted in postToAll.
> However, if it's interrupted in AsyncEventQueue#dispatch, SparkContext would 
> be stopped.
> In some queue (e.g. event log), we could stop AsyncEventQueue when 
> interrupted in dispatch, rather than stop the SparkContext.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23719) Use correct hostname in non-host networking mode in hadoop 3 docker support

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23719:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Use correct hostname in non-host networking mode in hadoop 3 docker support
> ---
>
> Key: SPARK-23719
> URL: https://issues.apache.org/jira/browse/SPARK-23719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Mridul Muralidharan
>Priority: Major
>
> Hostname (node-id's hostname field) specified by RM in allocated containers 
> is the NM_HOST and not the hostname which will be used by the container when 
> running in docker container executor : the actual container hostname is 
> generated at runtime.
> Due to this spark executor's are unable to launch in non-host networking mode 
> when leveraging docker support in hadoop 3 - due to bind failures as hostname 
> they are trying to bind to is of the host machine and not the container.
> We can leverage YARN-7935 to fetch the container's hostname (when available) 
> else fallback to existing mechanism - when running executors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27640) Avoid duplicate lookups for datasource through provider

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27640:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Avoid duplicate lookups for datasource through provider
> ---
>
> Key: SPARK-27640
> URL: https://issues.apache.org/jira/browse/SPARK-27640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Minor
>
> Spark SQL using code as follows to lookup datasource.
> {code:java}
> DataSource.lookupDataSource(source, sparkSession.sqlContext.conf){code}
> But exists some duplicate call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27523) Resolve scheme-less event log directory relative to default filesystem

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27523:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Resolve scheme-less event log directory relative to default filesystem
> --
>
> Key: SPARK-27523
> URL: https://issues.apache.org/jira/browse/SPARK-27523
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Mikayla Konst
>Priority: Major
>
> Currently, if I specify a default filesystem:
> fs.defaultFS=hdfs://host
> And a scheme-less event log directory:
> spark.eventLog.dir=/some/dir
> Then the event log directory is always resolved with respect to the local 
> filesystem.
> I propose that it should be resolved with respect to the default filesystem.
> Targeting Spark 3.0.0 since this is technically a breaking change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30846) Add AccumulatorV2 API in JavaSparkContext

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30846:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add AccumulatorV2 API in JavaSparkContext
> -
>
> Key: SPARK-30846
> URL: https://issues.apache.org/jira/browse/SPARK-30846
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> AccumulatorV2 API is missing in JavaSparkContext.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29586) spark jdbc method param lowerBound and upperBound DataType wrong

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29586:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> spark jdbc method param lowerBound and upperBound DataType wrong
> 
>
> Key: SPARK-29586
> URL: https://issues.apache.org/jira/browse/SPARK-29586
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: daile
>Priority: Major
>
>  
> {code:java}
> private def toBoundValueInWhereClause(
> value: Long,
> columnType: DataType,
> timeZoneId: String): String = {
>   def dateTimeToString(): String = {
> val dateTimeStr = columnType match {
>   case DateType => DateFormatter().format(value.toInt)
>   case TimestampType =>
> val timestampFormatter = TimestampFormatter.getFractionFormatter(
>   DateTimeUtils.getZoneId(timeZoneId))
> DateTimeUtils.timestampToString(timestampFormatter, value)
> }
> s"'$dateTimeStr'"
>   }
>   columnType match {
> case _: NumericType => value.toString
> case DateType | TimestampType => dateTimeToString()
>   }
> }{code}
> partitionColumn supoort NumericType, TimestampType, TimestampType but jdbc 
> method only accept Long
>  
> {code:java}
> test("jdbc Suite2") {
>   val df = spark
> .read
> .option("partitionColumn", "B")
> .option("lowerBound", "2017-01-01 10:00:00")
> .option("upperBound", "2019-01-01 10:00:00")
> .option("numPartitions", 5)
> .jdbc(urlWithUserAndPass, "TEST.TIMETYPES",  new Properties())
>   df.printSchema()
>   df.show()
> }
> {code}
> it's OK
>  
> {code:java}
> test("jdbc Suite") { val df = spark.read.jdbc(urlWithUserAndPass, 
> "TEST.TIMETYPES", "B", 1571899768024L, 1571899768024L, 5, new Properties()) 
> df.printSchema() df.show() }
> {code}
>  
> {code:java}
> java.lang.IllegalArgumentException: Cannot parse the bound value 
> 1571899768024 as datejava.lang.IllegalArgumentException: Cannot parse the 
> bound value 1571899768024 as date at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.$anonfun$toInternalBoundValue$1(JDBCRelation.scala:184)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.parse$1(JDBCRelation.scala:183)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:189)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.columnPartition(JDBCRelation.scala:88)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:339)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:240) 
> at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:229)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:229) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:179) at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:255) at 
> org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:297) at 
> org.apache.spark.sql.jdbc.JDBCSuite.$anonfun$new$186(JDBCSuite.scala:1664) at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at 
> org.scalatest.Transformer.apply(Transformer.scala:22) at 
> org.scalatest.Transformer.apply(Transformer.scala:20) at 
> org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149) at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) at 
> org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196) at 
> org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at 
> org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196) at 
> org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178) at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221) at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214) at 
> org.apache.spark.sql.jdbc.JDBCSuite.org$scalatest$BeforeAndAfter$$super$runTest(JDBCSuite.scala:43)
>  at org.scalatest.BeforeAndAfter.runTest(BeforeAndAfter.scala:203) at 
> org.scalatest.BeforeAndAfter.runTest$(BeforeAndAfter.scala:192) at 
> 

[jira] [Updated] (SPARK-29785) Optimize opening a new session of Spark Thrift Server

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29785:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Optimize opening a new session of Spark Thrift Server
> -
>
> Key: SPARK-29785
> URL: https://issues.apache.org/jira/browse/SPARK-29785
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Deegue
>Priority: Major
>
> When we open a new session of Spark Thrift Server, `use default` is called 
> and a free executor is needed to execute the SQL. This behavior adds ~5 
> seconds to opening a new session which should only cost ~100ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24780) DataFrame.column_name should resolve to a distinct ref

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24780:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> DataFrame.column_name should resolve to a distinct ref
> --
>
> Key: SPARK-24780
> URL: https://issues.apache.org/jira/browse/SPARK-24780
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Minor
>
> If we join a dataframe with another dataframe which has the same column name 
> of the conditions (e.g. shared lineage on one of the conditions) even though 
> the join condition may be written with the full name, the columns returned 
> don't have the dataframe alias and as such will create a cross-join.
> For example this currently works even if both posts_by_sampled_authors  &  
> mailing_list_posts_in_reply_to contain both in_reply_to and message_id fields.
>  
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [F.col("mailing_list_posts_in_reply_to.in_reply_to") == 
> F.col("posts_by_sampled_authors.message_id")],
>  "inner"){code}
>  
> But a similarly written expression:
> {code:java}
> posts_with_replies = posts_by_sampled_authors.join(
>  mailing_list_posts_in_reply_to,
>  [mailing_list_posts_in_reply_to.in_reply_to == 
> posts_by_sampled_authors.message_id],
>  "inner"){code}
> will fail.
>  
> I'm not super sure whats going on inside of the resolution that's causing it 
> to get confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28867) InMemoryStore checkpoint to speed up replay log file in HistoryServer

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28867:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> InMemoryStore checkpoint to speed up replay log file in HistoryServer
> -
>
> Key: SPARK-28867
> URL: https://issues.apache.org/jira/browse/SPARK-28867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> HistoryServer now could be very slow to replay a large log file at the first 
> time and it always re-replay an inprogress log file after it changes. we 
> could periodically checkpoint InMemoryStore to speed up replay log file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25927) Fix number of partitions returned by outputPartitioning

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25927:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Fix number of partitions returned by outputPartitioning
> ---
>
> Key: SPARK-25927
> URL: https://issues.apache.org/jira/browse/SPARK-25927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently default number of partitions returned by the outputPartitioning() 
> method of SparkPlan is 0 that doesn't reflect to actual number of partitions 
> for some execution nodes. To return actual number, need to override the 
> outputPartitioning() method in child nodes and implement it properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30497) migrate DESCRIBE TABLE to the new framework

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30497:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> migrate DESCRIBE TABLE to the new framework
> ---
>
> Key: SPARK-30497
> URL: https://issues.apache.org/jira/browse/SPARK-30497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26238) Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26238:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Set SPARK_CONF_DIR to be ${SPARK_HOME}/conf for K8S
> ---
>
> Key: SPARK-26238
> URL: https://issues.apache.org/jira/browse/SPARK-26238
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Ilan Filonenko
>Priority: Minor
>
> Set SPARK_CONF_DIR to point to ${SPARK_HOME}/conf as opposed to 
> /opt/spark/conf which is hard-coded into the Constants. This is expected 
> behavior according to spark docs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25342) Support rolling back a result stage

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25342:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support rolling back a result stage
> ---
>
> Key: SPARK-25342
> URL: https://issues.apache.org/jira/browse/SPARK-25342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
> To completely fix that problem, Spark needs to be able to rollback a result 
> stage and rerun all the result tasks.
> However, the result stage may do file committing, which does not support 
> re-commit a task currently. We should either support to rollback a committed 
> task, or abort the entire committing and do it again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30768) Constraints inferred from inequality attributes

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30768:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Constraints inferred from inequality attributes
> ---
>
> Key: SPARK-30768
> URL: https://issues.apache.org/jira/browse/SPARK-30768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> create table SPARK_30768_1(c1 int, c2 int);
> create table SPARK_30768_2(c1 int, c2 int);
> {code}
> *Spark SQL*:
> {noformat}
> spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> == Physical Plan ==
> *(3) Project [c1#5, c2#6]
> +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
>:- *(1) Project [c1#5, c2#6]
>:  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
>: +- *(1) ColumnarToRow
>:+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: 
> true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
> ReadSchema: struct
>+- BroadcastExchange IdentityBroadcastMode, [id=#60]
>   +- *(2) Project [c1#7]
>  +- *(2) Filter isnotnull(c1#7)
> +- *(2) ColumnarToRow
>+- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
> DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> {noformat}
> *Hive* support this feature:
> {noformat}
> hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
> product
> OK
> STAGE DEPENDENCIES:
>   Stage-4 is a root stage
>   Stage-3 depends on stages: Stage-4
>   Stage-0 depends on stages: Stage-3
> STAGE PLANS:
>   Stage: Stage-4
> Map Reduce Local Work
>   Alias -> Map Local Tables:
> $hdt$_0:t1
>   Fetch Operator
> limit: -1
>   Alias -> Map Local Operator Tree:
> $hdt$_0:t1
>   TableScan
> alias: t1
> filterExpr: (c1 = 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 = 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: c2 (type: int)
> outputColumnNames: _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> HashTable Sink Operator
>   keys:
> 0
> 1
>   Stage: Stage-3
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: t2
> filterExpr: (c1 < 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 < 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> Map Join Operator
>   condition map:
>Inner Join 0 to 1
>   keys:
> 0
> 1
>   outputColumnNames: _col1
>   Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: 3 (type: int), _col1 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> PARTIAL Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>   serde: 
> 

[jira] [Updated] (SPARK-29652) AlterTable/DescribeTable should not list the table as a child

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29652:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> AlterTable/DescribeTable should not list the table as a child
> -
>
> Key: SPARK-29652
> URL: https://issues.apache.org/jira/browse/SPARK-29652
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27359) Joins on some array functions can be optimized

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27359:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Joins on some array functions can be optimized
> --
>
> Key: SPARK-27359
> URL: https://issues.apache.org/jira/browse/SPARK-27359
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.1.0
>Reporter: Nikolas Vanderhoof
>Priority: Minor
>
> I encounter these cases frequently, and implemented the optimization manually 
> (as shown here). If others experience this as well, perhaps it would be good 
> to add appropriate tree transformations into catalyst. 
> h1. Case 1
> A join like this:
> {code:scala}
> left.join(
>   right,
>   arrays_overlap(left("a"), right("b")) // Creates a cartesian product in 
> the logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_a", explode(col("a")))
>   val rightPrime = right.withColumn("exploded_b", explode(col("b")))
>   leftPrime.join(
> rightPrime,
> leftPrime("exploded_a") === rightPrime("exploded_b")
>   // Equijoin doesn't produce cartesian product
>   ).drop("exploded_a", "exploded_b").distinct
> }
> {code}
> h1. Case 2
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(left("arr"), right("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))
>   leftPrime.join(
> right,
> leftPrime("exploded_arr") === right("value") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}
> h1. Case 3
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(right("arr"), left("value")) // Cartesian product in logical 
> plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))
>   left.join(
> rightPrime,
> left("value") === rightPrime("exploded_arr") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30828) Improve insertInto behaviour

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30828:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Improve insertInto behaviour
> 
>
> Key: SPARK-30828
> URL: https://issues.apache.org/jira/browse/SPARK-30828
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> Actually when you call *_insertInto_* to add a dataFrame into an existing 
> table the only safety check is that the number of columns match, but the 
> order doesn't matter, and the message in case that the number of columns 
> doesn't match is not very helpful, specially when you have  a lot of columns:
> {code:java}
>  org.apache.spark.sql.AnalysisException: `default`.`table` requires that the 
> data to be inserted have the same number of columns as the target table: 
> target table has 2 column(s) but the inserted data has 1 column(s), including 
> 0 partition column(s) having constant value(s).; {code}
> I think a standard column check would be very helpful, just like in almost 
> other cases with Spark:
>  
> {code:java}
> "cannot resolve 'p2' given input columns: [id, p1];"  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28149) Disable negeative DNS caching

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28149:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Disable negeative DNS caching
> -
>
> Key: SPARK-28149
> URL: https://issues.apache.org/jira/browse/SPARK-28149
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Jose Luis Pedrosa
>Priority: Minor
>
> By default JVM caches the failures for the DNS resolutions, by default is 
> cached by 10 seconds.
> Alpine JDK used in the images for kubernetes has a default timout of 5 
> seconds.
> This means that in clusters with slow init time (network sidecar pods, slow 
> network start up) executor will never run, because the first attempt to 
> connect to the driver will fail, and that failure will be cached, causing  
> the retries to happen in a tight loop without actually trying again.
>  
> The proposed implementation would be to add to the entrypoint.sh (that is 
> exclusive for k8s) to alter the file with the dns caching, and disable it if 
> there's an environment variable as "DISABLE_DNS_NEGATIVE_CACHING" defined. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26344) Support for flexVolume mount for Kubernetes

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26344:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support for flexVolume mount for Kubernetes
> ---
>
> Key: SPARK-26344
> URL: https://issues.apache.org/jira/browse/SPARK-26344
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Eric Carlson
>Priority: Minor
>
> Currently only hostPath, emptyDir, and PVC volume types are accepted for 
> Kubernetes-deployed drivers and executors.
> flexVolume types allow for pluggable volume drivers to be used in Kubernetes 
> - a widely used example of this is the Rook deployment of CephFS, which 
> provides a POSIX-compliant distributed filesystem integrated into K8s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27708) Add documentation for v2 data sources

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27708:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add documentation for v2 data sources
> -
>
> Key: SPARK-27708
> URL: https://issues.apache.org/jira/browse/SPARK-27708
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ryan Blue
>Priority: Major
>  Labels: documentation
>
> Before the 3.0 release, the new v2 data sources should be documented. This 
> includes:
>  * How to plug in catalog implementations
>  * Catalog plugin configuration
>  * Multi-part identifier behavior
>  * Partition transforms
>  * Table properties that are used to pass table info (e.g. "provider")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29492) SparkThriftServer can't support jar class as table serde class when executestatement in sync mode

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29492:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> SparkThriftServer  can't support jar class as table serde class when 
> executestatement in sync mode
> --
>
> Key: SPARK-29492
> URL: https://issues.apache.org/jira/browse/SPARK-29492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Add UT in HiveThriftBinaryServerSuit:
> {code}
>   test("jar in sync mode") {
> withCLIServiceClient { client =>
>   val user = System.getProperty("user.name")
>   val sessionHandle = client.openSession(user, "")
>   val confOverlay = new java.util.HashMap[java.lang.String, 
> java.lang.String]
>   val jarFile = HiveTestJars.getHiveHcatalogCoreJar().getCanonicalPath
>   Seq(s"ADD JAR $jarFile",
> "CREATE TABLE smallKV(key INT, val STRING)",
> s"LOAD DATA LOCAL INPATH '${TestData.smallKv}' OVERWRITE INTO TABLE 
> smallKV")
> .foreach(query => client.executeStatement(sessionHandle, query, 
> confOverlay))
>   client.executeStatement(sessionHandle,
> """CREATE TABLE addJar(key string)
>   |ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
> """.stripMargin, confOverlay)
>   client.executeStatement(sessionHandle,
> "INSERT INTO TABLE addJar SELECT 'k1' as key FROM smallKV limit 1", 
> confOverlay)
>   val operationHandle = client.executeStatement(
> sessionHandle,
> "SELECT key FROM addJar",
> confOverlay)
>   // Fetch result first time
>   assertResult(1, "Fetching result first time from next row") {
> val rows_next = client.fetchResults(
>   operationHandle,
>   FetchOrientation.FETCH_NEXT,
>   1000,
>   FetchType.QUERY_OUTPUT)
> rows_next.numRows()
>   }
> }
>   }
> {code}
> Run it then got ClassNotFound error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25166) Reduce the number of write operations for shuffle write.

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25166:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Reduce the number of write operations for shuffle write.
> 
>
> Key: SPARK-25166
> URL: https://issues.apache.org/jira/browse/SPARK-25166
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: liuxian
>Priority: Minor
>
> Currently, only one record is written to a buffer each time, which increases 
> the number of copies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28940) Subquery reuse across all subquery levels

2020-03-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28940:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Subquery reuse across all subquery levels
> -
>
> Key: SPARK-28940
> URL: https://issues.apache.org/jira/browse/SPARK-28940
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Peter Toth
>Priority: Major
>
> Currently subquery reuse doesn't work across all subquery levels.
> Here is an example query:
> {noformat}
> SELECT (SELECT avg(key) FROM testData), (SELECT (SELECT avg(key) FROM 
> testData))
> FROM testData
> LIMIT 1
> {noformat}
> where the plan now is:
> {noformat}
> CollectLimit 1
> +- *(1) Project [Subquery scalar-subquery#268, [id=#231] AS 
> scalarsubquery()#276, Subquery scalar-subquery#270, [id=#266] AS 
> scalarsubquery()#277]
>:  :- Subquery scalar-subquery#268, [id=#231]
>:  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as 
> bigint))], output=[avg(key)#272])
>:  : +- Exchange SinglePartition, true, [id=#227]
>:  :+- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#282, count#283L])
>:  :   +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:  :  +- Scan[obj#12]
>:  +- Subquery scalar-subquery#270, [id=#266]
>: +- *(1) Project [Subquery scalar-subquery#269, [id=#263] AS 
> scalarsubquery()#275]
>::  +- Subquery scalar-subquery#269, [id=#263]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#274])
>::+- Exchange SinglePartition, true, [id=#259]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#286, count#287L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}
> but it could be:
> {noformat}
> CollectLimit 1
> +- *(1) Project [ReusedSubquery Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#248, Subquery scalar-subquery#242, [id=#164] AS 
> scalarsubquery()#249]
>:  :- ReusedSubquery Subquery scalar-subquery#241, [id=#148]
>:  +- Subquery scalar-subquery#242, [id=#164]
>: +- *(1) Project [Subquery scalar-subquery#241, [id=#148] AS 
> scalarsubquery()#247]
>::  +- Subquery scalar-subquery#241, [id=#148]
>:: +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 
> as bigint))], output=[avg(key)#246])
>::+- Exchange SinglePartition, true, [id=#144]
>::   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(cast(key#13 as bigint))], output=[sum#258, count#259L])
>::  +- *(1) SerializeFromObject 
> [knownnotnull(assertnotnull(input[0, 
> org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
>:: +- Scan[obj#12]
>:+- *(1) Scan OneRowRelation[]
>+- *(1) SerializeFromObject
>   +- Scan[obj#12]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   >