[jira] [Created] (SPARK-27755) Update zstd-jni to 1.4.0-1

2019-05-16 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27755:
-

 Summary: Update zstd-jni to 1.4.0-1
 Key: SPARK-27755
 URL: https://issues.apache.org/jira/browse/SPARK-27755
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27634) deleteCheckpointOnStop should be configurable

2019-05-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27634.
---
Resolution: Duplicate

> deleteCheckpointOnStop should be configurable
> -
>
> Key: SPARK-27634
> URL: https://issues.apache.org/jira/browse/SPARK-27634
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.2
>Reporter: Yu Wang
>Priority: Minor
> Attachments: SPARK-27634.patch
>
>
> we need to delete checkpoint file after running the stream application 
> multiple times, so deleteCheckpointOnStop should be configurable



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0

2019-05-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27752.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24629

> Updata lz4-java from 1.5.1 to 1.6.0
> ---
>
> Key: SPARK-27752
> URL: https://issues.apache.org/jira/browse/SPARK-27752
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Update lz4-java that is available from https://github.com/lz4/lz4-java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27718) incorrect result from pagerank

2019-05-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27718.
---
Resolution: Not A Problem

> incorrect result from pagerank
> --
>
> Key: SPARK-27718
> URL: https://issues.apache.org/jira/browse/SPARK-27718
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.4.1
>Reporter: De-En Lin
>Priority: Minor
> Attachments: 螢幕快照 2019-05-16 上午10.09.45.png
>
>
> When I executed /examples/src/main/python/pagerank.py 
> The result is shown as follows
>  
> {code:java}
> 1 has rank: 0.5821576292853757.
> 2 has rank: 0.3361551945789305.
> 3 has rank: 0.3361551945789305.
> 4 has rank: 0.3361551945789305.
> {code}
>  
> However, the same graph executed in networkx-pagerank. The result 
> shown as follows
> {code:java}
> {1: 0.4797305739863632, 2: 0.1734231420045456, 3: 0.1734231420045456, 4: 
> 0.1734231420045456}
> {code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27751) buildReader is now protected

2019-05-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27751.
--
Resolution: Invalid

> buildReader is now protected
> 
>
> Key: SPARK-27751
> URL: https://issues.apache.org/jira/browse/SPARK-27751
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Geet Kumar
>Priority: Major
>
> I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` 
> method. It originally was public and now it is protected. 
> What was the reason for this change?
> The only workaround I can see is to use `buildReaderWithPartitionValues` 
> which remains public. Any plans to revert `buildReader` to be public again?
> The change was made here: [https://github.com/apache/spark/pull/17253/files]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27751) buildReader is now protected

2019-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841821#comment-16841821
 ] 

Hyukjin Kwon commented on SPARK-27751:
--

all the classes in `execution` package is subject to be a private as documented 
in {{package.scala}}. We don't maintain the compatibility there.

> buildReader is now protected
> 
>
> Key: SPARK-27751
> URL: https://issues.apache.org/jira/browse/SPARK-27751
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Geet Kumar
>Priority: Major
>
> I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` 
> method. It originally was public and now it is protected. 
> What was the reason for this change?
> The only workaround I can see is to use `buildReaderWithPartitionValues` 
> which remains public. Any plans to revert `buildReader` to be public again?
> The change was made here: [https://github.com/apache/spark/pull/17253/files]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841811#comment-16841811
 ] 

Hyukjin Kwon commented on SPARK-27733:
--

Then it should be blocked by a ticktat that targets hive upgrade.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841811#comment-16841811
 ] 

Hyukjin Kwon edited comment on SPARK-27733 at 5/16/19 11:34 PM:


Then it should be blocked by a ticket that targets hive upgrade within Spark


was (Author: hyukjin.kwon):
Then it should be blocked by a ticktat that targets hive upgrade.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27576) table capabilty to skip the output column resolution

2019-05-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27576.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24469

> table capabilty to skip the output column resolution
> 
>
> Key: SPARK-27576
> URL: https://issues.apache.org/jira/browse/SPARK-27576
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27735) Interval string in upper case is not supported in Trigger

2019-05-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27735.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.4
   2.3.4

This is resolved via https://github.com/apache/spark/pull/24619

> Interval string in upper case is not supported in Trigger
> -
>
> Key: SPARK-27735
> URL: https://issues.apache.org/jira/browse/SPARK-27735
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Some APIs in Structured Streaming requires the user to specify an interval. 
> Right now these APIs don't accept upper-case strings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27754) Introduce spark on k8s config for driver request cores

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27754:


Assignee: (was: Apache Spark)

> Introduce spark on k8s config for driver request cores
> --
>
> Key: SPARK-27754
> URL: https://issues.apache.org/jira/browse/SPARK-27754
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Arun Mahadevan
>Priority: Minor
>
> Spark on k8s supports config for specifying the executor cpu requests
>  (spark.kubernetes.executor.request.cores) but a similar config is missing
>  for the driver. Apparently `spark.driver.cores` works but its not evident 
> that this accepts
>  fractional values (its defined as an Integer config but apparently accepts 
> decimals). To keep in sync
>  with the executor config a similar driver config can be
>  introduced (spark.kubernetes.driver.request.cores) for explicitly specifying
>  the driver CPU requests. If not provided, the value will default to 
> `spark.driver.cores` as before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27754) Introduce spark on k8s config for driver request cores

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27754:


Assignee: Apache Spark

> Introduce spark on k8s config for driver request cores
> --
>
> Key: SPARK-27754
> URL: https://issues.apache.org/jira/browse/SPARK-27754
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Arun Mahadevan
>Assignee: Apache Spark
>Priority: Minor
>
> Spark on k8s supports config for specifying the executor cpu requests
>  (spark.kubernetes.executor.request.cores) but a similar config is missing
>  for the driver. Apparently `spark.driver.cores` works but its not evident 
> that this accepts
>  fractional values (its defined as an Integer config but apparently accepts 
> decimals). To keep in sync
>  with the executor config a similar driver config can be
>  introduced (spark.kubernetes.driver.request.cores) for explicitly specifying
>  the driver CPU requests. If not provided, the value will default to 
> `spark.driver.cores` as before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27754) Introduce spark on k8s config for driver request cores

2019-05-16 Thread Arun Mahadevan (JIRA)
Arun Mahadevan created SPARK-27754:
--

 Summary: Introduce spark on k8s config for driver request cores
 Key: SPARK-27754
 URL: https://issues.apache.org/jira/browse/SPARK-27754
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Arun Mahadevan


Spark on k8s supports config for specifying the executor cpu requests
 (spark.kubernetes.executor.request.cores) but a similar config is missing
 for the driver. Apparently `spark.driver.cores` works but its not evident that 
this accepts
 fractional values (its defined as an Integer config but apparently accepts 
decimals). To keep in sync
 with the executor config a similar driver config can be
 introduced (spark.kubernetes.driver.request.cores) for explicitly specifying
 the driver CPU requests. If not provided, the value will default to 
`spark.driver.cores` as before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27752:


Assignee: Apache Spark

> Updata lz4-java from 1.5.1 to 1.6.0
> ---
>
> Key: SPARK-27752
> URL: https://issues.apache.org/jira/browse/SPARK-27752
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> Update lz4-java that is available from https://github.com/lz4/lz4-java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27752:


Assignee: (was: Apache Spark)

> Updata lz4-java from 1.5.1 to 1.6.0
> ---
>
> Key: SPARK-27752
> URL: https://issues.apache.org/jira/browse/SPARK-27752
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Update lz4-java that is available from https://github.com/lz4/lz4-java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841614#comment-16841614
 ] 

Thomas Graves commented on SPARK-27736:
---

to clarify my last suggestion, I mean each executor reports back to the driver 
about the fetch failure and the driver could see that multiple fetch failures 
happened to that same host for different executors output and then choose to 
invalidate all the output on that host if X number have already failed to 
fetch.  There are other things the driver could use the information on.  

 

> Improve handling of FetchFailures caused by ExternalShuffleService losing 
> track of executor registrations
> -
>
> Key: SPARK-27736
> URL: https://issues.apache.org/jira/browse/SPARK-27736
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Minor
>
> This ticket describes a fault-tolerance edge-case which can cause Spark jobs 
> to fail if a single external shuffle service process reboots and fails to 
> recover the list of registered executors (something which can happen when 
> using YARN if NodeManager recovery is disabled) _and_ the Spark job has a 
> large number of executors per host.
> I believe this problem can be worked around today via a change of 
> configurations, but I'm filing this issue to (a) better document this 
> problem, and (b) propose either a change of default configurations or 
> additional DAGScheduler logic to better handle this failure mode.
> h2. Problem description
> The external shuffle service process is _mostly_ stateless except for a map 
> tracking the set of registered applications and executors.
> When processing a shuffle fetch request, the shuffle services first checks 
> whether the requested block ID's executor is registered; if it's not 
> registered then the shuffle service throws an exception like 
> {code:java}
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1557557221330_6891, execId=428){code}
> and this exception becomes a {{FetchFailed}} error in the executor requesting 
> the shuffle block.
> In normal operation this error should not occur because executors shouldn't 
> be mis-routing shuffle fetch requests. However, this _can_ happen if the 
> shuffle service crashes and restarts, causing it to lose its in-memory 
> executor registration state. With YARN this state can be recovered from disk 
> if YARN NodeManager recovery is enabled (using the mechanism added in 
> SPARK-9439), but I don't believe that we perform state recovery in Standalone 
> and Mesos modes (see SPARK-24223).
> If state cannot be recovered then map outputs cannot be served (even though 
> the files probably still exist on disk). In theory, this shouldn't cause 
> Spark jobs to fail because we can always redundantly recompute lost / 
> unfetchable map outputs.
> However, in practice this can cause total job failures in deployments where 
> the node with the failed shuffle service was running a large number of 
> executors: by default, the DAGScheduler unregisters map outputs _only from 
> individual executor whose shuffle blocks could not be fetched_ (see 
> [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]),
>  so it can take several rounds of failed stage attempts to fail and clear 
> output from all executors on the faulty host. If the number of executors on a 
> host is greater than the stage retry limit then this can exhaust stage retry 
> attempts and cause job failures.
> This "multiple rounds of recomputation to discover all failed executors on a 
> host" problem was addressed by SPARK-19753, which added a 
> {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which 
> promotes executor fetch failures into host-wide fetch failures (clearing 
> output from all neighboring executors upon a single failure). However, that 
> configuration is {{false}} by default.
> h2. Potential solutions
> I have a few ideas about how we can improve this situation:
>  - Update the [YARN external shuffle service 
> documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service]
>  to recommend enabling node manager recovery.
>  - Consider defaulting {{spark.files.fetchFailure.unRegisterOutputOnHost}} to 
> {{true}}. This would improve out-of-the-box resiliency for large clusters. 
> The trade-off here is a reduction of efficiency in case there are transient 
> "false positive" fetch failures, but I suspect this case may be unlikely in 
> practice (so the change of default could be an acceptable trade-off). See 
> [prior 

[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841593#comment-16841593
 ] 

Thomas Graves commented on SPARK-27736:
---

Yeah we always ran yarn with node manager recover on, but that doesn't help 
standalone mode unless you implement something similar.  But either way I think 
documenting it on yarn is a good idea.

We used to see transient fetch failures all the time, because of temporary 
spikes in disk usage, so I would be hesitant to turn on  
spark.files.fetchFailure.unRegisterOutputOnHost by default, but on the other 
hand users could turn it back off too, so it depends on what people think is 
most common.  

I don't think you can assume the death of shuffle service (NM on yarn) implies 
death of executor. We have seen Nodemanagers goes down with OOM and executor 
stays up. Without the NM there, there isn't really anything to clean up the 
containers on it.  Now you will obviously fetch fail from that node if it does 
go down.

Your last option seems like the best of those but like you mention could get a 
bit ugly with the String matching.

The other thing you can do is start tracking those fetch failures and have the 
driver make a more informed decision on that. This is work we had started to do 
at my previous employer but never had time to finish it. Its a much bigger 
change but really what we should be doing. It would allow us to make better 
decisions about black listing and see was it the map or reduce node that has 
issues, etc.

 

> Improve handling of FetchFailures caused by ExternalShuffleService losing 
> track of executor registrations
> -
>
> Key: SPARK-27736
> URL: https://issues.apache.org/jira/browse/SPARK-27736
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Minor
>
> This ticket describes a fault-tolerance edge-case which can cause Spark jobs 
> to fail if a single external shuffle service process reboots and fails to 
> recover the list of registered executors (something which can happen when 
> using YARN if NodeManager recovery is disabled) _and_ the Spark job has a 
> large number of executors per host.
> I believe this problem can be worked around today via a change of 
> configurations, but I'm filing this issue to (a) better document this 
> problem, and (b) propose either a change of default configurations or 
> additional DAGScheduler logic to better handle this failure mode.
> h2. Problem description
> The external shuffle service process is _mostly_ stateless except for a map 
> tracking the set of registered applications and executors.
> When processing a shuffle fetch request, the shuffle services first checks 
> whether the requested block ID's executor is registered; if it's not 
> registered then the shuffle service throws an exception like 
> {code:java}
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1557557221330_6891, execId=428){code}
> and this exception becomes a {{FetchFailed}} error in the executor requesting 
> the shuffle block.
> In normal operation this error should not occur because executors shouldn't 
> be mis-routing shuffle fetch requests. However, this _can_ happen if the 
> shuffle service crashes and restarts, causing it to lose its in-memory 
> executor registration state. With YARN this state can be recovered from disk 
> if YARN NodeManager recovery is enabled (using the mechanism added in 
> SPARK-9439), but I don't believe that we perform state recovery in Standalone 
> and Mesos modes (see SPARK-24223).
> If state cannot be recovered then map outputs cannot be served (even though 
> the files probably still exist on disk). In theory, this shouldn't cause 
> Spark jobs to fail because we can always redundantly recompute lost / 
> unfetchable map outputs.
> However, in practice this can cause total job failures in deployments where 
> the node with the failed shuffle service was running a large number of 
> executors: by default, the DAGScheduler unregisters map outputs _only from 
> individual executor whose shuffle blocks could not be fetched_ (see 
> [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]),
>  so it can take several rounds of failed stage attempts to fail and clear 
> output from all executors on the faulty host. If the number of executors on a 
> host is greater than the stage retry limit then this can exhaust stage retry 
> attempts and cause job failures.
> This "multiple rounds of recomputation to discover all failed executors on a 
> host" problem was addressed by SPARK-19753, which added a 
> 

[jira] [Created] (SPARK-27753) Support SQL expressions for interval parameter in Structured Streaming

2019-05-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-27753:


 Summary: Support SQL expressions for interval parameter in 
Structured Streaming
 Key: SPARK-27753
 URL: https://issues.apache.org/jira/browse/SPARK-27753
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Shixiong Zhu


Structured Streaming has several methods that accept an interval string. It 
would be great that we can use the parser to parse it so that we can also 
support SQL expressions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27752) Updata lz4-java from 1.5.2 to 1.6.0

2019-05-16 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-27752:


 Summary: Updata lz4-java from 1.5.2 to 1.6.0
 Key: SPARK-27752
 URL: https://issues.apache.org/jira/browse/SPARK-27752
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Update lz4-java that is available from https://github.com/lz4/lz4-java.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0

2019-05-16 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-27752:
-
Summary: Updata lz4-java from 1.5.1 to 1.6.0  (was: Updata lz4-java from 
1.5.2 to 1.6.0)

> Updata lz4-java from 1.5.1 to 1.6.0
> ---
>
> Key: SPARK-27752
> URL: https://issues.apache.org/jira/browse/SPARK-27752
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Update lz4-java that is available from https://github.com/lz4/lz4-java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27751) buildReader is now protected

2019-05-16 Thread Geet Kumar (JIRA)
Geet Kumar created SPARK-27751:
--

 Summary: buildReader is now protected
 Key: SPARK-27751
 URL: https://issues.apache.org/jira/browse/SPARK-27751
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Geet Kumar


I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` 
method. It originally was public and now it is protected. 

What was the reason for this change?

The only workaround I can see is to use `buildReaderWithPartitionValues` which 
remains public. Any plans to revert `buildReader` to be public again?

The change was made here: [https://github.com/apache/spark/pull/17253/files]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27749:


Assignee: Apache Spark

> Fix hadoop-3.2 hive-thriftserver module test issue
> --
>
> Key: SPARK-27749
> URL: https://issues.apache.org/jira/browse/SPARK-27749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27749:


Assignee: (was: Apache Spark)

> Fix hadoop-3.2 hive-thriftserver module test issue
> --
>
> Key: SPARK-27749
> URL: https://issues.apache.org/jira/browse/SPARK-27749
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12

2019-05-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27745.

Resolution: Not A Bug

You need to run {{./dev/change-scala-version.sh}} first. Pretty sure this is in 
the documentation.

> build/mvn take wrong scala version when compile for scala 2.12
> --
>
> Key: SPARK-27745
> URL: https://issues.apache.org/jira/browse/SPARK-27745
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Izek Greenfield
>Priority: Major
>
> in `build/mvn`
> line: 
> local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" 
> | head -n1 | awk -F '[<>]' '{print $3}'`
> it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take 
> 2.11 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841506#comment-16841506
 ] 

Thomas Graves edited comment on SPARK-27373 at 5/16/19 4:21 PM:


for the kubernetes side, it has 2 options for requesting containers: 1) pod 
templates, 2) through normal spark and spark.kubernetes configs

For adding in the spark resource support, we can take the spark configs 
spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a 
new config for the vendor name like 
spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device 
plugin support from k8s ( 
[https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)]
 and add that to the PodBuilder.

We could make the vendor config kubernetes specific, but I'm thinking we leave 
it generic and then just state its only supported on kubernetes right now.  
Depending on the setup, I could see this being useful for say YARN since yarn 
support attributes and vendor could be an attribute

spark already has functionality to override and add certain things in the pod 
templates so we can use similar functionality with the resources. So we can 
support both the pod templates and the configs the same way.


was (Author: tgraves):
for the kubernetes side, it has 2 options for requesting containers: 1) pod 
templates, 2) through normal spark and spark.kubernetes configs

For adding in the spark resource support, we can take the spark configs 
spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a 
new config for the vendor name like 
spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device 
plugin support from k8s ( 
[https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)]
 and add that to the PodBuilder.

spark already has functionality to override and add certain things in the pod 
templates so we can use similar functionality with the resources. So we can 
support both the pod templates and the configs the same way.

> Design: Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27373
> URL: https://issues.apache.org/jira/browse/SPARK-27373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841506#comment-16841506
 ] 

Thomas Graves commented on SPARK-27373:
---

for the kubernetes side, it has 2 options for requesting containers: 1) pod 
templates, 2) through normal spark and spark.kubernetes configs

For adding in the spark resource support, we can take the spark configs 
spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a 
new config for the vendor name like 
spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device 
plugin support from k8s ( 
[https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)]
 and add that to the PodBuilder.

spark already has functionality to override and add certain things in the pod 
templates so we can use similar functionality with the resources. So we can 
support both the pod templates and the configs the same way.

> Design: Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27373
> URL: https://issues.apache.org/jira/browse/SPARK-27373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27373:
-

Assignee: Thomas Graves

> Design: Kubernetes support for GPU-aware scheduling
> ---
>
> Key: SPARK-27373
> URL: https://issues.apache.org/jira/browse/SPARK-27373
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2019-05-16 Thread t oo (JIRA)
t oo created SPARK-27750:


 Summary: Standalone scheduler - ability to prioritize applications 
over drivers, many drivers act like Denial of Service
 Key: SPARK-27750
 URL: https://issues.apache.org/jira/browse/SPARK-27750
 Project: Spark
  Issue Type: New Feature
  Components: Scheduler
Affects Versions: 2.4.3, 2.3.3
Reporter: t oo


If I submit 1000 spark submit drivers then they consume all the cores on my 
cluster (essentially it acts like a Denial of Service) and no spark 
'application' gets to run since the cores are all consumed by the 'drivers'. 
This feature is about having the ability to prioritize applications over 
drivers so that at least some 'applications' can start running. I guess it 
would be like: If (driver.state = 'submitted' and (exists some app.state = 
'submitted')) then set app.state = 'running'

if all apps have app.state = 'running' then set driver.state = 'submitted' 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27748) Kafka consumer/producer password/token redaction

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27748:


Assignee: (was: Apache Spark)

> Kafka consumer/producer password/token redaction
> 
>
> Key: SPARK-27748
> URL: https://issues.apache.org/jira/browse/SPARK-27748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27748) Kafka consumer/producer password/token redaction

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27748:


Assignee: Apache Spark

> Kafka consumer/producer password/token redaction
> 
>
> Key: SPARK-27748
> URL: https://issues.apache.org/jira/browse/SPARK-27748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue

2019-05-16 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27749:
---

 Summary: Fix hadoop-3.2 hive-thriftserver module test issue
 Key: SPARK-27749
 URL: https://issues.apache.org/jira/browse/SPARK-27749
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27748) Kafka consumer/producer password/token redaction

2019-05-16 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-27748:
-

 Summary: Kafka consumer/producer password/token redaction
 Key: SPARK-27748
 URL: https://issues.apache.org/jira/browse/SPARK-27748
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27747) add a logical plan link in the physical plan

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27747:


Assignee: Wenchen Fan  (was: Apache Spark)

> add a logical plan link in the physical plan
> 
>
> Key: SPARK-27747
> URL: https://issues.apache.org/jira/browse/SPARK-27747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27747) add a logical plan link in the physical plan

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27747:


Assignee: Apache Spark  (was: Wenchen Fan)

> add a logical plan link in the physical plan
> 
>
> Key: SPARK-27747
> URL: https://issues.apache.org/jira/browse/SPARK-27747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27746) add a logical plan link in the physical plan

2019-05-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27746:
---

 Summary: add a logical plan link in the physical plan
 Key: SPARK-27746
 URL: https://issues.apache.org/jira/browse/SPARK-27746
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27747) add a logical plan link in the physical plan

2019-05-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27747:
---

 Summary: add a logical plan link in the physical plan
 Key: SPARK-27747
 URL: https://issues.apache.org/jira/browse/SPARK-27747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU

2019-05-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27377.
---
Resolution: Fixed

> Upgrade YARN to 3.1.2+ to support GPU
> -
>
> Key: SPARK-27377
> URL: https://issues.apache.org/jira/browse/SPARK-27377
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> This task should be covered by SPARK-23710. Just a placeholder here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841404#comment-16841404
 ] 

Thomas Graves commented on SPARK-27376:
---

[~mengxr] [~jiangxb]

Thoughts on my proposal above to rename the user facing resource config from 
.count to .amount and also adding it to the existing yarn configs?

> Design: YARN supports Spark GPU-aware scheduling
> 
>
> Key: SPARK-27376
> URL: https://issues.apache.org/jira/browse/SPARK-27376
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841410#comment-16841410
 ] 

Nandor Kollar commented on SPARK-27733:
---

For example HiveCatalogedDDLSuite "create hive serde table with 
DataFrameWriter.saveAsTable" test failed with
{code}
An exception or error caused a run to abort: 
org.apache.avro.Schema$Field.(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V
 
java.lang.NoSuchMethodError: 
org.apache.avro.Schema$Field.(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V
at 
org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76)
at 
org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:150)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:109)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:80)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
{code}

This is a problem with 1.2.1 Hive version.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12

2019-05-16 Thread Izek Greenfield (JIRA)
Izek Greenfield created SPARK-27745:
---

 Summary: build/mvn take wrong scala version when compile for scala 
2.12
 Key: SPARK-27745
 URL: https://issues.apache.org/jira/browse/SPARK-27745
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.3
Reporter: Izek Greenfield


in `build/mvn`

line: 
local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" | 
head -n1 | awk -F '[<>]' '{print $3}'`

it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take 
2.11 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27378) spark-submit requests GPUs in YARN mode

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841412#comment-16841412
 ] 

Thomas Graves commented on SPARK-27378:
---

Spark 3.0 already added support for requesting any resource from YARN via the 
configs: spark.yarn.\{executor/driver/am}.resource, so the changes required for 
this Jira are simply to map the new spark configs: 
spark.\{executor/driver}.resource.\{fpga/gpu}.count into the corresponding yarn 
configs. For other resource types we can't map them though because we don't 
know what they are called on the yarn side.  So for any other resource they 
will have to specify both configs spark.yarn.\{executor/driver/am}.resource and 
spark.\{executor/driver}.resource.\{fpga/gpu}.  That isn't ideal but the only 
other option would be to have some sort of mapping the user would pass in.  We 
can always add more yarn resource types if it adds them. The main 2 people are 
interested in seem to be gpu and fpga anyway, so I think for now this is fine.

> spark-submit requests GPUs in YARN mode
> ---
>
> Key: SPARK-27378
> URL: https://issues.apache.org/jira/browse/SPARK-27378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27379) YARN passes GPU info to Spark executor

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841409#comment-16841409
 ] 

Thomas Graves commented on SPARK-27379:
---

The way yarn works is it actually doesn't tell the application any info about 
what is was allocated. If you have hadoop 3.1+ and it setup for docker and 
isolation then its up to the user to discover what the container has. 

So based on that, I'm going to close this.

> YARN passes GPU info to Spark executor
> --
>
> Key: SPARK-27379
> URL: https://issues.apache.org/jira/browse/SPARK-27379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27379) YARN passes GPU info to Spark executor

2019-05-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27379.
---
Resolution: Invalid
  Assignee: Thomas Graves

> YARN passes GPU info to Spark executor
> --
>
> Key: SPARK-27379
> URL: https://issues.apache.org/jira/browse/SPARK-27379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841407#comment-16841407
 ] 

Thomas Graves commented on SPARK-27377:
---

there are enough pieces in the hadoop 3.2 support impelemnted that this is no 
longer blocking us so I'm going to close this.

> Upgrade YARN to 3.1.2+ to support GPU
> -
>
> Key: SPARK-27377
> URL: https://issues.apache.org/jira/browse/SPARK-27377
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> This task should be covered by SPARK-23710. Just a placeholder here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27376:
-

Assignee: Thomas Graves

> Design: YARN supports Spark GPU-aware scheduling
> 
>
> Key: SPARK-27376
> URL: https://issues.apache.org/jira/browse/SPARK-27376
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling

2019-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841402#comment-16841402
 ] 

Thomas Graves commented on SPARK-27376:
---

The design is pretty straight forward, there is really only 1 question which is 
consistency between the yarn resource configs and now the new spark resource 
configs, see the last paragraph for more details.

Require Hadoop 3.1 and > to get official GPU support.  Hadoop can be configured 
to use docker with isolation so that the containers yarn hands you back has the 
requested gpu's and other resources.  YARN does not give you information about 
what it allocated for gpu's, you have to discover it.  YARN has hardcoded 
resource types for fpga and gpu, anything else is user defined types. Spark 3.0 
already added support for requesting any resource from YARN via the configs: 
spark.yarn.\{executor/driver/am}.resource, so the changes required for this 
Jira are simply to map the new spark configs: 
spark.\{executor/driver}.resource.\{fpga/gpu}.count into the corresponding yarn 
configs. For other resource types we can't map them though because we don't 
know what they are called on the yarn side.  So for any other resource they 
will have to specify both configs spark.yarn.\{executor/driver/am}.resource and 
spark.\{executor/driver}.resource.\{fpga/gpu}.  That isn't ideal but the only 
other option would be to have some sort of mapping the user would pass in.  We 
can always add more yarn resource types if it adds them. The main 2 people are 
interested in seem to be gpu and fpga anyway, so I think for now this is fine.

For versions < hadoop 3.1 it won't allocate based on GPU, so if they are using 
hadoop 2.7, 2.8, etc they could still allocate nodes with GPU, with yarn node 
labels or other hacks, and tell Spark the count and to auto discover them and 
Spark will pick up whatever it sees in the container - or really whatever the 
discoveryScript returns, so people could potentially write that script to match 
whatever hacks they have for sharing gpu nodes now.

The  flow from user point would be:

For GPU and FPGA: User will specify the 
spark.\{executor/driver}.resource.\{gpu/fpga}.count and the 
spark.\{executor/driver}.resource.\{gpu/fpga}.discoveryScript. The spark yarn 
code maps these into the corresponding yarn resource config and asks yarn for 
the containers.  Yarn allocates the containers and Spark will run the discovery 
script to figure out what it has for allocations.

For other resource types the user will have to specify:  
spark.yarn.\{executor/driver/am}.resource and 
spark.\{executor/driver}.resource.\{gpu/fpga}.count and the 
spark.\{executor/driver}.resource.\{gpu/fpga}.discoveryScript.  

The only other thing that is a inconsistent is the 
spark.yarn.\{executor/driver/am}.resource configs don't  have a .count on the 
end. Right now that config takes a string as a value and splits that into an 
actual count and a unit. The yarn resource configs were just added in 3.0 so 
haven't been released so we could potentially change them.  We could change the 
spark user facing configs ( 
spark.\{executor/driver}.resource.\{gpu/fpga}.count) to be similar to make it 
easier for the user to specify both a count and unit in 1 config instead of 2, 
but I like the ability to separate them on the discovery side as well. We took  
the .unit support out in the executor pull request so it isn't there right now 
anyway.  We could do the opposite and change the yarn ones to have a .count and 
.unit as well just to make things consistent but that makes user have to 
specify 2 instead of 1.  Or the third option would be to have the .count and 
.unit and then eventually have a third one that lets the user specify them 
together if we add resources that actually use it.

My thoughts are  for the user facing configs we change .count to be .amount and 
let the user specify units on it. This makes it easier for the user and it 
allows us to extend later if we want. I think we should also change the 
spark.yarn configs to have a .amount because yarn has already added other 
things like tags and attributes so we if want to extend the spark support for 
those it makes more sense to have those as another postfix option 
spark.yarn...resource.tags=

We can leave everything else that is internal as separate count and units and 
since gpu/fpga don't need units we don't need to actually add it to our 
ResourceInformation since we already removed it. 

 

> Design: YARN supports Spark GPU-aware scheduling
> 
>
> Key: SPARK-27376
> URL: https://issues.apache.org/jira/browse/SPARK-27376
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>




--
This message was sent by 

[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client

2019-05-16 Thread KaiXu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841351#comment-16841351
 ] 

KaiXu commented on SPARK-18107:
---

it seems this issue have not been fixed? I encountered this issue with 
spark2.4.3, the query I run is from TPC-DS, 
[https://github.com/hortonworks/hive-testbench/blob/hdp3/ddl-tpcds/bin_partitioned/store_sales.sql]

> Insert overwrite statement runs much slower in spark-sql than it does in 
> hive-client
> 
>
> Key: SPARK-18107
> URL: https://issues.apache.org/jira/browse/SPARK-18107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: spark 2.0.0
> hive 2.0.1
>Reporter: snodawn
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.1.0
>
>
> I find insert overwrite statement running in spark-sql or spark-shell spends 
> much more time than it does in  hive-client (i start it in 
> apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but 
> hive-client just costs less than 20 seconds.
> These are the steps I took.
> Test sql is :
> insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
> select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as 
> platform, 'mix' as pid, 'mix' as dev from tbllog_login  where pt='mix_en' and 
>  dt='2016-10-21' ;
> there are 257128 lines of data in tbllog_login with 
> partition(pt='mix_en',dt='2016-10-21')
> ps:
> I'm sure it must be "insert overwrite" costing a lot of time in spark, may be 
> when doing overwrite, it need to spend a lot of time in io or in something 
> else.
> I also compare the executing time between insert overwrite statement and 
> insert into statement.
> 1. insert overwrite statement and insert into statement in spark:
> insert overwrite statement costs about 10 minutes
> insert into statement costs about 30 seconds
> 2. insert into statement in spark and insert into statement in hive-client:
> spark costs about 30 seconds
> hive-client costs about 20 seconds
> the difference is little that we can ignore
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-16 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841332#comment-16841332
 ] 

Gabor Somogyi commented on SPARK-27742:
---

Kafka delegation token support just added to 3.0 on source and sink side as 
well. There Kerberos + SSL also supported.
Since I'm involved in streaming happy to be part of this effort (though not 
sure how much to be done).


> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27720) ConcurrentModificationException on operating with DirectKafkaInputDStream

2019-05-16 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841325#comment-16841325
 ] 

Gabor Somogyi commented on SPARK-27720:
---

[~ov7a] Thanks for your efforts and I've had a look on the provided example + 
stacktrace.
Not sure why you've called start on the stream itself? (one should call start 
on StreamingContext only) Please have a look at the official DStream + Kafka 
example 
[here|https://github.com/apache/spark/blob/c6a45e6f67abc99d1953d915b96e65a3e2148cf1/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala#L79].

> ConcurrentModificationException on operating with DirectKafkaInputDStream
> -
>
> Key: SPARK-27720
> URL: https://issues.apache.org/jira/browse/SPARK-27720
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.1, 2.4.3
>Reporter: ov7a
>Priority: Minor
>
> If a DirectKafkaInputDStream is started in one thread and is being stopped in 
> another thread (e.g. by shutdown hook) a 
> java.util.ConcurrentModificationException (KafkaConsumer is not safe for 
> multi-threaded access) is thrown.
> This happens even if "spark.streaming.kafka.consumer.cache.enabled" is set to 
> "false".
> MWE: https://gist.github.com/ov7a/fc783315ea252a03d51804ce326a13b1
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27334) Support specify scheduler name for executor pods when submit

2019-05-16 Thread Alexander Fedosov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841322#comment-16841322
 ] 

Alexander Fedosov commented on SPARK-27334:
---

Hello [~TommyLike]!
It looks like this ticket relates to this 
[one|https://issues.apache.org/jira/browse/SPARK-24434], where it was decided 
to use Pod Template approach.
Could you please then close the ticket?

> Support specify scheduler name for executor pods when submit
> 
>
> Key: SPARK-27334
> URL: https://issues.apache.org/jira/browse/SPARK-27334
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: TommyLike
>Priority: Major
>  Labels: easyfix, features
>
> Currently, there are some external schedulers which bring a lot a great value 
> into kubernetes scheduling especially for HPC case, take a look at the 
> *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to 
> support it, we had to use Pod Template which seems cumbersome. It would be 
> much convenient if this can be configured via option such as 
> *"spark.kubernetes.executor.schedulerName"* just like others.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-16 Thread Lukas Rytz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841287#comment-16841287
 ] 

Lukas Rytz commented on SPARK-25075:


In our own interest of testing Scala the 2.13 RCs, I took a stab at compiling 
spark core on 2.13.

My goal was *not* to get something running, just compiling and seeing what kind 
of breaking changes there are. So I chose a slightly adventurous methodology: I 
updated the {{scalaVersion}} to 2.13.0-pre-06392a5-SNAPSHOT (a very recent 
local build), but forced the {{scalaBinaryVersion}} to 2.12, so that the 2.12 
dependencies end up on the classpath. That way I didn't have to worry about 
missing / incompatible dependencies.

The first step was to avoid using {{scala.Seq}}, I used scalafix to rewrite all 
references of the type {{scala.Seq}} to {{scala.collection.Seq}}. As discussed 
on https://issues.apache.org/jira/browse/SPARK-27681, this is not necessarily 
the best solution, but the easiest.

Here's a list of other breaking changes:
 * {{foo(someMutableOrGenericCollection: _*)}} no longer works, because varargs 
de-sugars to {{scala.Seq}}, so an immutable collection is now required. Calling 
{{.toSeq}} works, but is inefficient. Better build an immutable collection from 
the beginning. For arrays, {{immutable.ArraySeq.unsafeWrapArray}} can be used. 
Maybe the standard library should provide an unsafe {{immutable.SeqWrapper}} 
that wraps a {{collection.Seq}} for the cases when the users are certain it's 
safe.
 * Views are quite different in 2.13, for example {{seq.view}} is no longer a 
{{Seq}}, views are a separate hierarchy. This needs some adjustments, not too 
difficult.
 * Parallel collections are a separate module now 
([https://github.com/scala/scala-parallel-collections]), no longer in the 
standard library. However, you might want to use StreamConverters instead to do 
parallel processing via a Java stream 
([https://github.com/scala/scala/blob/2.13.x/src/library/scala/jdk/StreamConverters.scala]).
 * Subclasses of collections need some adjustments: {{BoundedPriorityQueue}}, 
{{TimeStampedHashMap}}. For example, {{++=}} cannot be overridden anymore, as 
it's a final alias for {{addAll}} now.

The other changes are relatively minor. The brannch is here - for reference, 
too hacky to be actually useful: 
[https://github.com/lrytz/spark/commits/2.13-experiment]. {{sbt core/compile}} 
passes (except for scalastyle).

Overall, this is more or less what we expected in terms of breaking changes. We 
definietly want to use the time between now and 2.13.0 final to improve 
migration documentation 
([https://docs.scala-lang.org/overviews/core/collections-migration-213.html]) 
and scalafix rules ([https://github.com/scala/scala-rewrites] is in its early 
days).

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties

2019-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841284#comment-16841284
 ] 

Apache Spark commented on SPARK-27744:
--

User 'onursatici' has created a pull request for this issue:
https://github.com/apache/spark/pull/24625

> SubqueryExec thread pool does not preserve thread local properties
> --
>
> Key: SPARK-27744
> URL: https://issues.apache.org/jira/browse/SPARK-27744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Onur Satici
>Priority: Major
>
> SubqueryExec uses a cached thread pool of size 16. After this thread pool 
> reaches its pool size, it will start reusing threads, and submitted tasks 
> would be run on a thread with out of date spark local properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27744:


Assignee: Apache Spark

> SubqueryExec thread pool does not preserve thread local properties
> --
>
> Key: SPARK-27744
> URL: https://issues.apache.org/jira/browse/SPARK-27744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Onur Satici
>Assignee: Apache Spark
>Priority: Major
>
> SubqueryExec uses a cached thread pool of size 16. After this thread pool 
> reaches its pool size, it will start reusing threads, and submitted tasks 
> would be run on a thread with out of date spark local properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27744:


Assignee: (was: Apache Spark)

> SubqueryExec thread pool does not preserve thread local properties
> --
>
> Key: SPARK-27744
> URL: https://issues.apache.org/jira/browse/SPARK-27744
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Onur Satici
>Priority: Major
>
> SubqueryExec uses a cached thread pool of size 16. After this thread pool 
> reaches its pool size, it will start reusing threads, and submitted tasks 
> would be run on a thread with out of date spark local properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27743) alter table: bucketing

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27743:


Assignee: (was: Apache Spark)

> alter table: bucketing
> --
>
> Key: SPARK-27743
> URL: https://issues.apache.org/jira/browse/SPARK-27743
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: xzh_dz
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27743) alter table: bucketing

2019-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27743:


Assignee: Apache Spark

> alter table: bucketing
> --
>
> Key: SPARK-27743
> URL: https://issues.apache.org/jira/browse/SPARK-27743
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: xzh_dz
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties

2019-05-16 Thread Onur Satici (JIRA)
Onur Satici created SPARK-27744:
---

 Summary: SubqueryExec thread pool does not preserve thread local 
properties
 Key: SPARK-27744
 URL: https://issues.apache.org/jira/browse/SPARK-27744
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Onur Satici


SubqueryExec uses a cached thread pool of size 16. After this thread pool 
reaches its pool size, it will start reusing threads, and submitted tasks would 
be run on a thread with out of date spark local properties.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27743) alter table: bucketing

2019-05-16 Thread xzh_dz (JIRA)
xzh_dz created SPARK-27743:
--

 Summary: alter table: bucketing
 Key: SPARK-27743
 URL: https://issues.apache.org/jira/browse/SPARK-27743
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.3
Reporter: xzh_dz






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27722) Remove UnsafeKeyValueSorter

2019-05-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27722.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24622
[https://github.com/apache/spark/pull/24622]

> Remove UnsafeKeyValueSorter
> ---
>
> Key: SPARK-27722
> URL: https://issues.apache.org/jira/browse/SPARK-27722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Shivu Sondur
>Priority: Minor
> Fix For: 3.0.0
>
>
> We just moved the location of classes including {{UnsafeKeyValueSorter}}. 
> After further investigating, I don't find where {{UnsafeKeyValueSorter}} is 
> used.
> If it is not used at all, shall we just remove it from codebase? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27722) Remove UnsafeKeyValueSorter

2019-05-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27722:
---

Assignee: Shivu Sondur

> Remove UnsafeKeyValueSorter
> ---
>
> Key: SPARK-27722
> URL: https://issues.apache.org/jira/browse/SPARK-27722
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Shivu Sondur
>Priority: Minor
>
> We just moved the location of classes including {{UnsafeKeyValueSorter}}. 
> After further investigating, I don't find where {{UnsafeKeyValueSorter}} is 
> used.
> If it is not used at all, shall we just remove it from codebase? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-16 Thread Stavros Kontopoulos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27742:

Summary: Security Support in Sources and Sinks for SS and Batch  (was: 
Security Support in Sources and Sinks for SS and batch)

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27742) Security Support in Sources and Sinks for SS and batch

2019-05-16 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-27742:
---

 Summary: Security Support in Sources and Sinks for SS and batch
 Key: SPARK-27742
 URL: https://issues.apache.org/jira/browse/SPARK-27742
 Project: Spark
  Issue Type: Brainstorming
  Components: SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Stavros Kontopoulos


As discussed with [~erikerlandson] on the [Big Data on K8s 
UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
 it would be good to capture current status and identify work that needs to be 
done for securing Spark when accessing sources and sinks. For example what is 
the status of SSL, Kerberos support in different scenarios. The big concern 
nowadays is how to secure data pipelines end-to-end. 

Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841193#comment-16841193
 ] 

Nandor Kollar commented on SPARK-27733:
---

[~hyukjin.kwon] I tried to run Spark tests after Avro upgrade, and saw several 
failure in spark-hive module, because Hive uses deprecated and removed Avro 
methods.

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841189#comment-16841189
 ] 

Hyukjin Kwon commented on SPARK-27733:
--

why is it dependent on Hive's?

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27733:
-
Component/s: (was: Spark Core)
 SQL

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27741.
--
Resolution: Duplicate

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: U Shaw
>Priority: Major
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841184#comment-16841184
 ] 

Hyukjin Kwon commented on SPARK-27741:
--

[~xyxiaoyou], can you check if the same issue exists in higher version? Let me 
leave this resolved for now.

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: U Shaw
>Priority: Major
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27733) Upgrade to Avro 1.9.x

2019-05-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismaël Mejía updated SPARK-27733:
-
Description: Avro 1.9.0 was released with many nice features including 
reduced size (1MB less), and removed dependencies, no paranmer, no shaded 
guava, security updates, so probably a worth upgrade.  (was: Avro 1.9.0 was 
released with many nice features including reduced size 1MB less, and removed 
dependencies, no paranmer, no shaded avro, security updates, so probably a 
worth upgrade.)

> Upgrade to Avro 1.9.x
> -
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.0 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranmer, no shaded guava, security 
> updates, so probably a worth upgrade.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread U Shaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

U Shaw updated SPARK-27741:
---
Affects Version/s: (was: 2.4.3)
   2.1.1

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: U Shaw
>Priority: Major
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2019-05-16 Thread Ruiguang Pei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840122#comment-16840122
 ] 

Ruiguang Pei edited comment on SPARK-24374 at 5/16/19 9:17 AM:
---

Hi, [~mengxr],[~jiangxb1987]

when I'm using Barrier Execution Mode, it seems that I can't  partition my data 
more than the number of total cores, otherwise it will throw the exception 
["Barrier execution mode does not allow run a barrier stage that requires more 
slots than the total number of slots in the cluster currently."].

Suppose that I have a extremely large RDD, but only 4 cores are available, 
which means that each partition is still too large. will it takes potential 
performance problems? Do you have some plans to support the scenario that more 
slots can be request than available?


was (Author: ruiguang pei):
Hi, [~mengxr]

when I'm using Barrier Execution Mode, it seems that I can't  partition my data 
more than the number of total cores, otherwise it will throw the exception 
["Barrier execution mode does not allow run a barrier stage that requires more 
slots than the total number of slots in the cluster currently."].

Suppose that I have a extremely large RDD, but only 4 cores are available, 
which means that each partition is still too large. will it takes potential 
performance problems? Do you have some plans to support the scenario that more 
slots can be request than available?

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841087#comment-16841087
 ] 

Yuming Wang commented on SPARK-27741:
-

This should be supported since SPARK-21479.

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: U Shaw
>Priority: Major
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread U Shaw (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

U Shaw updated SPARK-27741:
---
Description: 
When using inner join, where conditions can be passed to join on, and when 
using outer join, even if the conditions are the same, only the predicate is 
pushed down to left or right.
As follows:

select * from t1 left join t2 on t1.id=t2.id where t1.id=1
--> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1

Is Catalyst can support transitivity on predicate pushdown ?

> Transitivity on predicate pushdown 
> ---
>
> Key: SPARK-27741
> URL: https://issues.apache.org/jira/browse/SPARK-27741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: U Shaw
>Priority: Major
>
> When using inner join, where conditions can be passed to join on, and when 
> using outer join, even if the conditions are the same, only the predicate is 
> pushed down to left or right.
> As follows:
> select * from t1 left join t2 on t1.id=t2.id where t1.id=1
> --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1
> Is Catalyst can support transitivity on predicate pushdown ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27741) Transitivity on predicate pushdown

2019-05-16 Thread U Shaw (JIRA)
U Shaw created SPARK-27741:
--

 Summary: Transitivity on predicate pushdown 
 Key: SPARK-27741
 URL: https://issues.apache.org/jira/browse/SPARK-27741
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.3
Reporter: U Shaw






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org