[jira] [Created] (SPARK-27755) Update zstd-jni to 1.4.0-1
Dongjoon Hyun created SPARK-27755: - Summary: Update zstd-jni to 1.4.0-1 Key: SPARK-27755 URL: https://issues.apache.org/jira/browse/SPARK-27755 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27634) deleteCheckpointOnStop should be configurable
[ https://issues.apache.org/jira/browse/SPARK-27634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27634. --- Resolution: Duplicate > deleteCheckpointOnStop should be configurable > - > > Key: SPARK-27634 > URL: https://issues.apache.org/jira/browse/SPARK-27634 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.2 >Reporter: Yu Wang >Priority: Minor > Attachments: SPARK-27634.patch > > > we need to delete checkpoint file after running the stream application > multiple times, so deleteCheckpointOnStop should be configurable -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27752. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24629 > Updata lz4-java from 1.5.1 to 1.6.0 > --- > > Key: SPARK-27752 > URL: https://issues.apache.org/jira/browse/SPARK-27752 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 3.0.0 > > > Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27718) incorrect result from pagerank
[ https://issues.apache.org/jira/browse/SPARK-27718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27718. --- Resolution: Not A Problem > incorrect result from pagerank > -- > > Key: SPARK-27718 > URL: https://issues.apache.org/jira/browse/SPARK-27718 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 2.4.1 >Reporter: De-En Lin >Priority: Minor > Attachments: 螢幕快照 2019-05-16 上午10.09.45.png > > > When I executed /examples/src/main/python/pagerank.py > The result is shown as follows > > {code:java} > 1 has rank: 0.5821576292853757. > 2 has rank: 0.3361551945789305. > 3 has rank: 0.3361551945789305. > 4 has rank: 0.3361551945789305. > {code} > > However, the same graph executed in networkx-pagerank. The result > shown as follows > {code:java} > {1: 0.4797305739863632, 2: 0.1734231420045456, 3: 0.1734231420045456, 4: > 0.1734231420045456} > {code} > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27751) buildReader is now protected
[ https://issues.apache.org/jira/browse/SPARK-27751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27751. -- Resolution: Invalid > buildReader is now protected > > > Key: SPARK-27751 > URL: https://issues.apache.org/jira/browse/SPARK-27751 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Geet Kumar >Priority: Major > > I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` > method. It originally was public and now it is protected. > What was the reason for this change? > The only workaround I can see is to use `buildReaderWithPartitionValues` > which remains public. Any plans to revert `buildReader` to be public again? > The change was made here: [https://github.com/apache/spark/pull/17253/files] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27751) buildReader is now protected
[ https://issues.apache.org/jira/browse/SPARK-27751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841821#comment-16841821 ] Hyukjin Kwon commented on SPARK-27751: -- all the classes in `execution` package is subject to be a private as documented in {{package.scala}}. We don't maintain the compatibility there. > buildReader is now protected > > > Key: SPARK-27751 > URL: https://issues.apache.org/jira/browse/SPARK-27751 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Geet Kumar >Priority: Major > > I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` > method. It originally was public and now it is protected. > What was the reason for this change? > The only workaround I can see is to use `buildReaderWithPartitionValues` > which remains public. Any plans to revert `buildReader` to be public again? > The change was made here: [https://github.com/apache/spark/pull/17253/files] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841811#comment-16841811 ] Hyukjin Kwon commented on SPARK-27733: -- Then it should be blocked by a ticktat that targets hive upgrade. > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841811#comment-16841811 ] Hyukjin Kwon edited comment on SPARK-27733 at 5/16/19 11:34 PM: Then it should be blocked by a ticket that targets hive upgrade within Spark was (Author: hyukjin.kwon): Then it should be blocked by a ticktat that targets hive upgrade. > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27576) table capabilty to skip the output column resolution
[ https://issues.apache.org/jira/browse/SPARK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27576. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/24469 > table capabilty to skip the output column resolution > > > Key: SPARK-27576 > URL: https://issues.apache.org/jira/browse/SPARK-27576 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27735) Interval string in upper case is not supported in Trigger
[ https://issues.apache.org/jira/browse/SPARK-27735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27735. --- Resolution: Fixed Fix Version/s: 3.0.0 2.4.4 2.3.4 This is resolved via https://github.com/apache/spark/pull/24619 > Interval string in upper case is not supported in Trigger > - > > Key: SPARK-27735 > URL: https://issues.apache.org/jira/browse/SPARK-27735 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.3.4, 2.4.4, 3.0.0 > > > Some APIs in Structured Streaming requires the user to specify an interval. > Right now these APIs don't accept upper-case strings. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27754) Introduce spark on k8s config for driver request cores
[ https://issues.apache.org/jira/browse/SPARK-27754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27754: Assignee: (was: Apache Spark) > Introduce spark on k8s config for driver request cores > -- > > Key: SPARK-27754 > URL: https://issues.apache.org/jira/browse/SPARK-27754 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Arun Mahadevan >Priority: Minor > > Spark on k8s supports config for specifying the executor cpu requests > (spark.kubernetes.executor.request.cores) but a similar config is missing > for the driver. Apparently `spark.driver.cores` works but its not evident > that this accepts > fractional values (its defined as an Integer config but apparently accepts > decimals). To keep in sync > with the executor config a similar driver config can be > introduced (spark.kubernetes.driver.request.cores) for explicitly specifying > the driver CPU requests. If not provided, the value will default to > `spark.driver.cores` as before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27754) Introduce spark on k8s config for driver request cores
[ https://issues.apache.org/jira/browse/SPARK-27754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27754: Assignee: Apache Spark > Introduce spark on k8s config for driver request cores > -- > > Key: SPARK-27754 > URL: https://issues.apache.org/jira/browse/SPARK-27754 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Minor > > Spark on k8s supports config for specifying the executor cpu requests > (spark.kubernetes.executor.request.cores) but a similar config is missing > for the driver. Apparently `spark.driver.cores` works but its not evident > that this accepts > fractional values (its defined as an Integer config but apparently accepts > decimals). To keep in sync > with the executor config a similar driver config can be > introduced (spark.kubernetes.driver.request.cores) for explicitly specifying > the driver CPU requests. If not provided, the value will default to > `spark.driver.cores` as before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27754) Introduce spark on k8s config for driver request cores
Arun Mahadevan created SPARK-27754: -- Summary: Introduce spark on k8s config for driver request cores Key: SPARK-27754 URL: https://issues.apache.org/jira/browse/SPARK-27754 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Arun Mahadevan Spark on k8s supports config for specifying the executor cpu requests (spark.kubernetes.executor.request.cores) but a similar config is missing for the driver. Apparently `spark.driver.cores` works but its not evident that this accepts fractional values (its defined as an Integer config but apparently accepts decimals). To keep in sync with the executor config a similar driver config can be introduced (spark.kubernetes.driver.request.cores) for explicitly specifying the driver CPU requests. If not provided, the value will default to `spark.driver.cores` as before. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27752: Assignee: Apache Spark > Updata lz4-java from 1.5.1 to 1.6.0 > --- > > Key: SPARK-27752 > URL: https://issues.apache.org/jira/browse/SPARK-27752 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Major > > Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27752: Assignee: (was: Apache Spark) > Updata lz4-java from 1.5.1 to 1.6.0 > --- > > Key: SPARK-27752 > URL: https://issues.apache.org/jira/browse/SPARK-27752 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations
[ https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841614#comment-16841614 ] Thomas Graves commented on SPARK-27736: --- to clarify my last suggestion, I mean each executor reports back to the driver about the fetch failure and the driver could see that multiple fetch failures happened to that same host for different executors output and then choose to invalidate all the output on that host if X number have already failed to fetch. There are other things the driver could use the information on. > Improve handling of FetchFailures caused by ExternalShuffleService losing > track of executor registrations > - > > Key: SPARK-27736 > URL: https://issues.apache.org/jira/browse/SPARK-27736 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Minor > > This ticket describes a fault-tolerance edge-case which can cause Spark jobs > to fail if a single external shuffle service process reboots and fails to > recover the list of registered executors (something which can happen when > using YARN if NodeManager recovery is disabled) _and_ the Spark job has a > large number of executors per host. > I believe this problem can be worked around today via a change of > configurations, but I'm filing this issue to (a) better document this > problem, and (b) propose either a change of default configurations or > additional DAGScheduler logic to better handle this failure mode. > h2. Problem description > The external shuffle service process is _mostly_ stateless except for a map > tracking the set of registered applications and executors. > When processing a shuffle fetch request, the shuffle services first checks > whether the requested block ID's executor is registered; if it's not > registered then the shuffle service throws an exception like > {code:java} > java.lang.RuntimeException: Executor is not registered > (appId=application_1557557221330_6891, execId=428){code} > and this exception becomes a {{FetchFailed}} error in the executor requesting > the shuffle block. > In normal operation this error should not occur because executors shouldn't > be mis-routing shuffle fetch requests. However, this _can_ happen if the > shuffle service crashes and restarts, causing it to lose its in-memory > executor registration state. With YARN this state can be recovered from disk > if YARN NodeManager recovery is enabled (using the mechanism added in > SPARK-9439), but I don't believe that we perform state recovery in Standalone > and Mesos modes (see SPARK-24223). > If state cannot be recovered then map outputs cannot be served (even though > the files probably still exist on disk). In theory, this shouldn't cause > Spark jobs to fail because we can always redundantly recompute lost / > unfetchable map outputs. > However, in practice this can cause total job failures in deployments where > the node with the failed shuffle service was running a large number of > executors: by default, the DAGScheduler unregisters map outputs _only from > individual executor whose shuffle blocks could not be fetched_ (see > [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]), > so it can take several rounds of failed stage attempts to fail and clear > output from all executors on the faulty host. If the number of executors on a > host is greater than the stage retry limit then this can exhaust stage retry > attempts and cause job failures. > This "multiple rounds of recomputation to discover all failed executors on a > host" problem was addressed by SPARK-19753, which added a > {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which > promotes executor fetch failures into host-wide fetch failures (clearing > output from all neighboring executors upon a single failure). However, that > configuration is {{false}} by default. > h2. Potential solutions > I have a few ideas about how we can improve this situation: > - Update the [YARN external shuffle service > documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service] > to recommend enabling node manager recovery. > - Consider defaulting {{spark.files.fetchFailure.unRegisterOutputOnHost}} to > {{true}}. This would improve out-of-the-box resiliency for large clusters. > The trade-off here is a reduction of efficiency in case there are transient > "false positive" fetch failures, but I suspect this case may be unlikely in > practice (so the change of default could be an acceptable trade-off). See > [prior
[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations
[ https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841593#comment-16841593 ] Thomas Graves commented on SPARK-27736: --- Yeah we always ran yarn with node manager recover on, but that doesn't help standalone mode unless you implement something similar. But either way I think documenting it on yarn is a good idea. We used to see transient fetch failures all the time, because of temporary spikes in disk usage, so I would be hesitant to turn on spark.files.fetchFailure.unRegisterOutputOnHost by default, but on the other hand users could turn it back off too, so it depends on what people think is most common. I don't think you can assume the death of shuffle service (NM on yarn) implies death of executor. We have seen Nodemanagers goes down with OOM and executor stays up. Without the NM there, there isn't really anything to clean up the containers on it. Now you will obviously fetch fail from that node if it does go down. Your last option seems like the best of those but like you mention could get a bit ugly with the String matching. The other thing you can do is start tracking those fetch failures and have the driver make a more informed decision on that. This is work we had started to do at my previous employer but never had time to finish it. Its a much bigger change but really what we should be doing. It would allow us to make better decisions about black listing and see was it the map or reduce node that has issues, etc. > Improve handling of FetchFailures caused by ExternalShuffleService losing > track of executor registrations > - > > Key: SPARK-27736 > URL: https://issues.apache.org/jira/browse/SPARK-27736 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Minor > > This ticket describes a fault-tolerance edge-case which can cause Spark jobs > to fail if a single external shuffle service process reboots and fails to > recover the list of registered executors (something which can happen when > using YARN if NodeManager recovery is disabled) _and_ the Spark job has a > large number of executors per host. > I believe this problem can be worked around today via a change of > configurations, but I'm filing this issue to (a) better document this > problem, and (b) propose either a change of default configurations or > additional DAGScheduler logic to better handle this failure mode. > h2. Problem description > The external shuffle service process is _mostly_ stateless except for a map > tracking the set of registered applications and executors. > When processing a shuffle fetch request, the shuffle services first checks > whether the requested block ID's executor is registered; if it's not > registered then the shuffle service throws an exception like > {code:java} > java.lang.RuntimeException: Executor is not registered > (appId=application_1557557221330_6891, execId=428){code} > and this exception becomes a {{FetchFailed}} error in the executor requesting > the shuffle block. > In normal operation this error should not occur because executors shouldn't > be mis-routing shuffle fetch requests. However, this _can_ happen if the > shuffle service crashes and restarts, causing it to lose its in-memory > executor registration state. With YARN this state can be recovered from disk > if YARN NodeManager recovery is enabled (using the mechanism added in > SPARK-9439), but I don't believe that we perform state recovery in Standalone > and Mesos modes (see SPARK-24223). > If state cannot be recovered then map outputs cannot be served (even though > the files probably still exist on disk). In theory, this shouldn't cause > Spark jobs to fail because we can always redundantly recompute lost / > unfetchable map outputs. > However, in practice this can cause total job failures in deployments where > the node with the failed shuffle service was running a large number of > executors: by default, the DAGScheduler unregisters map outputs _only from > individual executor whose shuffle blocks could not be fetched_ (see > [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]), > so it can take several rounds of failed stage attempts to fail and clear > output from all executors on the faulty host. If the number of executors on a > host is greater than the stage retry limit then this can exhaust stage retry > attempts and cause job failures. > This "multiple rounds of recomputation to discover all failed executors on a > host" problem was addressed by SPARK-19753, which added a >
[jira] [Created] (SPARK-27753) Support SQL expressions for interval parameter in Structured Streaming
Shixiong Zhu created SPARK-27753: Summary: Support SQL expressions for interval parameter in Structured Streaming Key: SPARK-27753 URL: https://issues.apache.org/jira/browse/SPARK-27753 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Shixiong Zhu Structured Streaming has several methods that accept an interval string. It would be great that we can use the parser to parse it so that we can also support SQL expressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27752) Updata lz4-java from 1.5.2 to 1.6.0
Kazuaki Ishizaki created SPARK-27752: Summary: Updata lz4-java from 1.5.2 to 1.6.0 Key: SPARK-27752 URL: https://issues.apache.org/jira/browse/SPARK-27752 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27752) Updata lz4-java from 1.5.1 to 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-27752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-27752: - Summary: Updata lz4-java from 1.5.1 to 1.6.0 (was: Updata lz4-java from 1.5.2 to 1.6.0) > Updata lz4-java from 1.5.1 to 1.6.0 > --- > > Key: SPARK-27752 > URL: https://issues.apache.org/jira/browse/SPARK-27752 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Update lz4-java that is available from https://github.com/lz4/lz4-java. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27751) buildReader is now protected
Geet Kumar created SPARK-27751: -- Summary: buildReader is now protected Key: SPARK-27751 URL: https://issues.apache.org/jira/browse/SPARK-27751 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.3 Reporter: Geet Kumar I have recently upgraded to spark 2.4.0 and was relying on the `buildReader` method. It originally was public and now it is protected. What was the reason for this change? The only workaround I can see is to use `buildReaderWithPartitionValues` which remains public. Any plans to revert `buildReader` to be public again? The change was made here: [https://github.com/apache/spark/pull/17253/files] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue
[ https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27749: Assignee: Apache Spark > Fix hadoop-3.2 hive-thriftserver module test issue > -- > > Key: SPARK-27749 > URL: https://issues.apache.org/jira/browse/SPARK-27749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue
[ https://issues.apache.org/jira/browse/SPARK-27749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27749: Assignee: (was: Apache Spark) > Fix hadoop-3.2 hive-thriftserver module test issue > -- > > Key: SPARK-27749 > URL: https://issues.apache.org/jira/browse/SPARK-27749 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-27745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-27745. Resolution: Not A Bug You need to run {{./dev/change-scala-version.sh}} first. Pretty sure this is in the documentation. > build/mvn take wrong scala version when compile for scala 2.12 > -- > > Key: SPARK-27745 > URL: https://issues.apache.org/jira/browse/SPARK-27745 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.3 >Reporter: Izek Greenfield >Priority: Major > > in `build/mvn` > line: > local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" > | head -n1 | awk -F '[<>]' '{print $3}'` > it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take > 2.11 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841506#comment-16841506 ] Thomas Graves edited comment on SPARK-27373 at 5/16/19 4:21 PM: for the kubernetes side, it has 2 options for requesting containers: 1) pod templates, 2) through normal spark and spark.kubernetes configs For adding in the spark resource support, we can take the spark configs spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a new config for the vendor name like spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device plugin support from k8s ( [https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)] and add that to the PodBuilder. We could make the vendor config kubernetes specific, but I'm thinking we leave it generic and then just state its only supported on kubernetes right now. Depending on the setup, I could see this being useful for say YARN since yarn support attributes and vendor could be an attribute spark already has functionality to override and add certain things in the pod templates so we can use similar functionality with the resources. So we can support both the pod templates and the configs the same way. was (Author: tgraves): for the kubernetes side, it has 2 options for requesting containers: 1) pod templates, 2) through normal spark and spark.kubernetes configs For adding in the spark resource support, we can take the spark configs spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a new config for the vendor name like spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device plugin support from k8s ( [https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)] and add that to the PodBuilder. spark already has functionality to override and add certain things in the pod templates so we can use similar functionality with the resources. So we can support both the pod templates and the configs the same way. > Design: Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27373 > URL: https://issues.apache.org/jira/browse/SPARK-27373 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841506#comment-16841506 ] Thomas Graves commented on SPARK-27373: --- for the kubernetes side, it has 2 options for requesting containers: 1) pod templates, 2) through normal spark and spark.kubernetes configs For adding in the spark resource support, we can take the spark configs spark.\{driver/executor}.resource.\{resourceName}.count and combine this with a new config for the vendor name like spark.\{driver/executor}.resource.\{resourceName}.vendor to match the device plugin support from k8s ( [https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/)] and add that to the PodBuilder. spark already has functionality to override and add certain things in the pod templates so we can use similar functionality with the resources. So we can support both the pod templates and the configs the same way. > Design: Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27373 > URL: https://issues.apache.org/jira/browse/SPARK-27373 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27373) Design: Kubernetes support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27373: - Assignee: Thomas Graves > Design: Kubernetes support for GPU-aware scheduling > --- > > Key: SPARK-27373 > URL: https://issues.apache.org/jira/browse/SPARK-27373 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service
t oo created SPARK-27750: Summary: Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service Key: SPARK-27750 URL: https://issues.apache.org/jira/browse/SPARK-27750 Project: Spark Issue Type: New Feature Components: Scheduler Affects Versions: 2.4.3, 2.3.3 Reporter: t oo If I submit 1000 spark submit drivers then they consume all the cores on my cluster (essentially it acts like a Denial of Service) and no spark 'application' gets to run since the cores are all consumed by the 'drivers'. This feature is about having the ability to prioritize applications over drivers so that at least some 'applications' can start running. I guess it would be like: If (driver.state = 'submitted' and (exists some app.state = 'submitted')) then set app.state = 'running' if all apps have app.state = 'running' then set driver.state = 'submitted' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27748) Kafka consumer/producer password/token redaction
[ https://issues.apache.org/jira/browse/SPARK-27748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27748: Assignee: (was: Apache Spark) > Kafka consumer/producer password/token redaction > > > Key: SPARK-27748 > URL: https://issues.apache.org/jira/browse/SPARK-27748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27748) Kafka consumer/producer password/token redaction
[ https://issues.apache.org/jira/browse/SPARK-27748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27748: Assignee: Apache Spark > Kafka consumer/producer password/token redaction > > > Key: SPARK-27748 > URL: https://issues.apache.org/jira/browse/SPARK-27748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27749) Fix hadoop-3.2 hive-thriftserver module test issue
Yuming Wang created SPARK-27749: --- Summary: Fix hadoop-3.2 hive-thriftserver module test issue Key: SPARK-27749 URL: https://issues.apache.org/jira/browse/SPARK-27749 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27748) Kafka consumer/producer password/token redaction
Gabor Somogyi created SPARK-27748: - Summary: Kafka consumer/producer password/token redaction Key: SPARK-27748 URL: https://issues.apache.org/jira/browse/SPARK-27748 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27747) add a logical plan link in the physical plan
[ https://issues.apache.org/jira/browse/SPARK-27747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27747: Assignee: Wenchen Fan (was: Apache Spark) > add a logical plan link in the physical plan > > > Key: SPARK-27747 > URL: https://issues.apache.org/jira/browse/SPARK-27747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27747) add a logical plan link in the physical plan
[ https://issues.apache.org/jira/browse/SPARK-27747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27747: Assignee: Apache Spark (was: Wenchen Fan) > add a logical plan link in the physical plan > > > Key: SPARK-27747 > URL: https://issues.apache.org/jira/browse/SPARK-27747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27746) add a logical plan link in the physical plan
Wenchen Fan created SPARK-27746: --- Summary: add a logical plan link in the physical plan Key: SPARK-27746 URL: https://issues.apache.org/jira/browse/SPARK-27746 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27747) add a logical plan link in the physical plan
Wenchen Fan created SPARK-27747: --- Summary: add a logical plan link in the physical plan Key: SPARK-27747 URL: https://issues.apache.org/jira/browse/SPARK-27747 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU
[ https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27377. --- Resolution: Fixed > Upgrade YARN to 3.1.2+ to support GPU > - > > Key: SPARK-27377 > URL: https://issues.apache.org/jira/browse/SPARK-27377 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > This task should be covered by SPARK-23710. Just a placeholder here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841404#comment-16841404 ] Thomas Graves commented on SPARK-27376: --- [~mengxr] [~jiangxb] Thoughts on my proposal above to rename the user facing resource config from .count to .amount and also adding it to the existing yarn configs? > Design: YARN supports Spark GPU-aware scheduling > > > Key: SPARK-27376 > URL: https://issues.apache.org/jira/browse/SPARK-27376 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841410#comment-16841410 ] Nandor Kollar commented on SPARK-27733: --- For example HiveCatalogedDDLSuite "create hive serde table with DataFrameWriter.saveAsTable" test failed with {code} An exception or error caused a run to abort: org.apache.avro.Schema$Field.(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Lorg/codehaus/jackson/JsonNode;)V at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.createAvroField(TypeInfoToSchema.java:76) at org.apache.hadoop.hive.serde2.avro.TypeInfoToSchema.convert(TypeInfoToSchema.java:61) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.getSchemaFromCols(AvroSerDe.java:150) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:109) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:80) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391) {code} This is a problem with 1.2.1 Hive version. > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27745) build/mvn take wrong scala version when compile for scala 2.12
Izek Greenfield created SPARK-27745: --- Summary: build/mvn take wrong scala version when compile for scala 2.12 Key: SPARK-27745 URL: https://issues.apache.org/jira/browse/SPARK-27745 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.3 Reporter: Izek Greenfield in `build/mvn` line: local scala_binary_version=`grep "scala.binary.version" "${_DIR}/../pom.xml" | head -n1 | awk -F '[<>]' '{print $3}'` it grep the pom and there will be 2.11 and if I set -Pscala-2.12 it will take 2.11 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27378) spark-submit requests GPUs in YARN mode
[ https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841412#comment-16841412 ] Thomas Graves commented on SPARK-27378: --- Spark 3.0 already added support for requesting any resource from YARN via the configs: spark.yarn.\{executor/driver/am}.resource, so the changes required for this Jira are simply to map the new spark configs: spark.\{executor/driver}.resource.\{fpga/gpu}.count into the corresponding yarn configs. For other resource types we can't map them though because we don't know what they are called on the yarn side. So for any other resource they will have to specify both configs spark.yarn.\{executor/driver/am}.resource and spark.\{executor/driver}.resource.\{fpga/gpu}. That isn't ideal but the only other option would be to have some sort of mapping the user would pass in. We can always add more yarn resource types if it adds them. The main 2 people are interested in seem to be gpu and fpga anyway, so I think for now this is fine. > spark-submit requests GPUs in YARN mode > --- > > Key: SPARK-27378 > URL: https://issues.apache.org/jira/browse/SPARK-27378 > Project: Spark > Issue Type: Sub-task > Components: Spark Submit, YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27379) YARN passes GPU info to Spark executor
[ https://issues.apache.org/jira/browse/SPARK-27379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841409#comment-16841409 ] Thomas Graves commented on SPARK-27379: --- The way yarn works is it actually doesn't tell the application any info about what is was allocated. If you have hadoop 3.1+ and it setup for docker and isolation then its up to the user to discover what the container has. So based on that, I'm going to close this. > YARN passes GPU info to Spark executor > -- > > Key: SPARK-27379 > URL: https://issues.apache.org/jira/browse/SPARK-27379 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27379) YARN passes GPU info to Spark executor
[ https://issues.apache.org/jira/browse/SPARK-27379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-27379. --- Resolution: Invalid Assignee: Thomas Graves > YARN passes GPU info to Spark executor > -- > > Key: SPARK-27379 > URL: https://issues.apache.org/jira/browse/SPARK-27379 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27377) Upgrade YARN to 3.1.2+ to support GPU
[ https://issues.apache.org/jira/browse/SPARK-27377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841407#comment-16841407 ] Thomas Graves commented on SPARK-27377: --- there are enough pieces in the hadoop 3.2 support impelemnted that this is no longer blocking us so I'm going to close this. > Upgrade YARN to 3.1.2+ to support GPU > - > > Key: SPARK-27377 > URL: https://issues.apache.org/jira/browse/SPARK-27377 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > This task should be covered by SPARK-23710. Just a placeholder here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27376: - Assignee: Thomas Graves > Design: YARN supports Spark GPU-aware scheduling > > > Key: SPARK-27376 > URL: https://issues.apache.org/jira/browse/SPARK-27376 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27376) Design: YARN supports Spark GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841402#comment-16841402 ] Thomas Graves commented on SPARK-27376: --- The design is pretty straight forward, there is really only 1 question which is consistency between the yarn resource configs and now the new spark resource configs, see the last paragraph for more details. Require Hadoop 3.1 and > to get official GPU support. Hadoop can be configured to use docker with isolation so that the containers yarn hands you back has the requested gpu's and other resources. YARN does not give you information about what it allocated for gpu's, you have to discover it. YARN has hardcoded resource types for fpga and gpu, anything else is user defined types. Spark 3.0 already added support for requesting any resource from YARN via the configs: spark.yarn.\{executor/driver/am}.resource, so the changes required for this Jira are simply to map the new spark configs: spark.\{executor/driver}.resource.\{fpga/gpu}.count into the corresponding yarn configs. For other resource types we can't map them though because we don't know what they are called on the yarn side. So for any other resource they will have to specify both configs spark.yarn.\{executor/driver/am}.resource and spark.\{executor/driver}.resource.\{fpga/gpu}. That isn't ideal but the only other option would be to have some sort of mapping the user would pass in. We can always add more yarn resource types if it adds them. The main 2 people are interested in seem to be gpu and fpga anyway, so I think for now this is fine. For versions < hadoop 3.1 it won't allocate based on GPU, so if they are using hadoop 2.7, 2.8, etc they could still allocate nodes with GPU, with yarn node labels or other hacks, and tell Spark the count and to auto discover them and Spark will pick up whatever it sees in the container - or really whatever the discoveryScript returns, so people could potentially write that script to match whatever hacks they have for sharing gpu nodes now. The flow from user point would be: For GPU and FPGA: User will specify the spark.\{executor/driver}.resource.\{gpu/fpga}.count and the spark.\{executor/driver}.resource.\{gpu/fpga}.discoveryScript. The spark yarn code maps these into the corresponding yarn resource config and asks yarn for the containers. Yarn allocates the containers and Spark will run the discovery script to figure out what it has for allocations. For other resource types the user will have to specify: spark.yarn.\{executor/driver/am}.resource and spark.\{executor/driver}.resource.\{gpu/fpga}.count and the spark.\{executor/driver}.resource.\{gpu/fpga}.discoveryScript. The only other thing that is a inconsistent is the spark.yarn.\{executor/driver/am}.resource configs don't have a .count on the end. Right now that config takes a string as a value and splits that into an actual count and a unit. The yarn resource configs were just added in 3.0 so haven't been released so we could potentially change them. We could change the spark user facing configs ( spark.\{executor/driver}.resource.\{gpu/fpga}.count) to be similar to make it easier for the user to specify both a count and unit in 1 config instead of 2, but I like the ability to separate them on the discovery side as well. We took the .unit support out in the executor pull request so it isn't there right now anyway. We could do the opposite and change the yarn ones to have a .count and .unit as well just to make things consistent but that makes user have to specify 2 instead of 1. Or the third option would be to have the .count and .unit and then eventually have a third one that lets the user specify them together if we add resources that actually use it. My thoughts are for the user facing configs we change .count to be .amount and let the user specify units on it. This makes it easier for the user and it allows us to extend later if we want. I think we should also change the spark.yarn configs to have a .amount because yarn has already added other things like tags and attributes so we if want to extend the spark support for those it makes more sense to have those as another postfix option spark.yarn...resource.tags= We can leave everything else that is internal as separate count and units and since gpu/fpga don't need units we don't need to actually add it to our ResourceInformation since we already removed it. > Design: YARN supports Spark GPU-aware scheduling > > > Key: SPARK-27376 > URL: https://issues.apache.org/jira/browse/SPARK-27376 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > -- This message was sent by
[jira] [Commented] (SPARK-18107) Insert overwrite statement runs much slower in spark-sql than it does in hive-client
[ https://issues.apache.org/jira/browse/SPARK-18107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841351#comment-16841351 ] KaiXu commented on SPARK-18107: --- it seems this issue have not been fixed? I encountered this issue with spark2.4.3, the query I run is from TPC-DS, [https://github.com/hortonworks/hive-testbench/blob/hdp3/ddl-tpcds/bin_partitioned/store_sales.sql] > Insert overwrite statement runs much slower in spark-sql than it does in > hive-client > > > Key: SPARK-18107 > URL: https://issues.apache.org/jira/browse/SPARK-18107 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: spark 2.0.0 > hive 2.0.1 >Reporter: snodawn >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.1.0 > > > I find insert overwrite statement running in spark-sql or spark-shell spends > much more time than it does in hive-client (i start it in > apache-hive-2.0.1-bin/bin/hive ), where spark costs about ten minutes but > hive-client just costs less than 20 seconds. > These are the steps I took. > Test sql is : > insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') > select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as > platform, 'mix' as pid, 'mix' as dev from tbllog_login where pt='mix_en' and > dt='2016-10-21' ; > there are 257128 lines of data in tbllog_login with > partition(pt='mix_en',dt='2016-10-21') > ps: > I'm sure it must be "insert overwrite" costing a lot of time in spark, may be > when doing overwrite, it need to spend a lot of time in io or in something > else. > I also compare the executing time between insert overwrite statement and > insert into statement. > 1. insert overwrite statement and insert into statement in spark: > insert overwrite statement costs about 10 minutes > insert into statement costs about 30 seconds > 2. insert into statement in spark and insert into statement in hive-client: > spark costs about 30 seconds > hive-client costs about 20 seconds > the difference is little that we can ignore > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841332#comment-16841332 ] Gabor Somogyi commented on SPARK-27742: --- Kafka delegation token support just added to 3.0 on source and sink side as well. There Kerberos + SSL also supported. Since I'm involved in streaming happy to be part of this effort (though not sure how much to be done). > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27720) ConcurrentModificationException on operating with DirectKafkaInputDStream
[ https://issues.apache.org/jira/browse/SPARK-27720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841325#comment-16841325 ] Gabor Somogyi commented on SPARK-27720: --- [~ov7a] Thanks for your efforts and I've had a look on the provided example + stacktrace. Not sure why you've called start on the stream itself? (one should call start on StreamingContext only) Please have a look at the official DStream + Kafka example [here|https://github.com/apache/spark/blob/c6a45e6f67abc99d1953d915b96e65a3e2148cf1/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala#L79]. > ConcurrentModificationException on operating with DirectKafkaInputDStream > - > > Key: SPARK-27720 > URL: https://issues.apache.org/jira/browse/SPARK-27720 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.4.1, 2.4.3 >Reporter: ov7a >Priority: Minor > > If a DirectKafkaInputDStream is started in one thread and is being stopped in > another thread (e.g. by shutdown hook) a > java.util.ConcurrentModificationException (KafkaConsumer is not safe for > multi-threaded access) is thrown. > This happens even if "spark.streaming.kafka.consumer.cache.enabled" is set to > "false". > MWE: https://gist.github.com/ov7a/fc783315ea252a03d51804ce326a13b1 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27334) Support specify scheduler name for executor pods when submit
[ https://issues.apache.org/jira/browse/SPARK-27334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841322#comment-16841322 ] Alexander Fedosov commented on SPARK-27334: --- Hello [~TommyLike]! It looks like this ticket relates to this [one|https://issues.apache.org/jira/browse/SPARK-24434], where it was decided to use Pod Template approach. Could you please then close the ticket? > Support specify scheduler name for executor pods when submit > > > Key: SPARK-27334 > URL: https://issues.apache.org/jira/browse/SPARK-27334 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: TommyLike >Priority: Major > Labels: easyfix, features > > Currently, there are some external schedulers which bring a lot a great value > into kubernetes scheduling especially for HPC case, take a look at the > *kube-batch* ([https://github.com/kubernetes-sigs/kube-batch]). In order to > support it, we had to use Pod Template which seems cumbersome. It would be > much convenient if this can be configured via option such as > *"spark.kubernetes.executor.schedulerName"* just like others. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841287#comment-16841287 ] Lukas Rytz commented on SPARK-25075: In our own interest of testing Scala the 2.13 RCs, I took a stab at compiling spark core on 2.13. My goal was *not* to get something running, just compiling and seeing what kind of breaking changes there are. So I chose a slightly adventurous methodology: I updated the {{scalaVersion}} to 2.13.0-pre-06392a5-SNAPSHOT (a very recent local build), but forced the {{scalaBinaryVersion}} to 2.12, so that the 2.12 dependencies end up on the classpath. That way I didn't have to worry about missing / incompatible dependencies. The first step was to avoid using {{scala.Seq}}, I used scalafix to rewrite all references of the type {{scala.Seq}} to {{scala.collection.Seq}}. As discussed on https://issues.apache.org/jira/browse/SPARK-27681, this is not necessarily the best solution, but the easiest. Here's a list of other breaking changes: * {{foo(someMutableOrGenericCollection: _*)}} no longer works, because varargs de-sugars to {{scala.Seq}}, so an immutable collection is now required. Calling {{.toSeq}} works, but is inefficient. Better build an immutable collection from the beginning. For arrays, {{immutable.ArraySeq.unsafeWrapArray}} can be used. Maybe the standard library should provide an unsafe {{immutable.SeqWrapper}} that wraps a {{collection.Seq}} for the cases when the users are certain it's safe. * Views are quite different in 2.13, for example {{seq.view}} is no longer a {{Seq}}, views are a separate hierarchy. This needs some adjustments, not too difficult. * Parallel collections are a separate module now ([https://github.com/scala/scala-parallel-collections]), no longer in the standard library. However, you might want to use StreamConverters instead to do parallel processing via a Java stream ([https://github.com/scala/scala/blob/2.13.x/src/library/scala/jdk/StreamConverters.scala]). * Subclasses of collections need some adjustments: {{BoundedPriorityQueue}}, {{TimeStampedHashMap}}. For example, {{++=}} cannot be overridden anymore, as it's a final alias for {{addAll}} now. The other changes are relatively minor. The brannch is here - for reference, too hacky to be actually useful: [https://github.com/lrytz/spark/commits/2.13-experiment]. {{sbt core/compile}} passes (except for scalastyle). Overall, this is more or less what we expected in terms of breaking changes. We definietly want to use the time between now and 2.13.0 final to improve migration documentation ([https://docs.scala-lang.org/overviews/core/collections-migration-213.html]) and scalafix rules ([https://github.com/scala/scala-rewrites] is in its early days). > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties
[ https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841284#comment-16841284 ] Apache Spark commented on SPARK-27744: -- User 'onursatici' has created a pull request for this issue: https://github.com/apache/spark/pull/24625 > SubqueryExec thread pool does not preserve thread local properties > -- > > Key: SPARK-27744 > URL: https://issues.apache.org/jira/browse/SPARK-27744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Onur Satici >Priority: Major > > SubqueryExec uses a cached thread pool of size 16. After this thread pool > reaches its pool size, it will start reusing threads, and submitted tasks > would be run on a thread with out of date spark local properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties
[ https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27744: Assignee: Apache Spark > SubqueryExec thread pool does not preserve thread local properties > -- > > Key: SPARK-27744 > URL: https://issues.apache.org/jira/browse/SPARK-27744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Onur Satici >Assignee: Apache Spark >Priority: Major > > SubqueryExec uses a cached thread pool of size 16. After this thread pool > reaches its pool size, it will start reusing threads, and submitted tasks > would be run on a thread with out of date spark local properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties
[ https://issues.apache.org/jira/browse/SPARK-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27744: Assignee: (was: Apache Spark) > SubqueryExec thread pool does not preserve thread local properties > -- > > Key: SPARK-27744 > URL: https://issues.apache.org/jira/browse/SPARK-27744 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Onur Satici >Priority: Major > > SubqueryExec uses a cached thread pool of size 16. After this thread pool > reaches its pool size, it will start reusing threads, and submitted tasks > would be run on a thread with out of date spark local properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27743) alter table: bucketing
[ https://issues.apache.org/jira/browse/SPARK-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27743: Assignee: (was: Apache Spark) > alter table: bucketing > -- > > Key: SPARK-27743 > URL: https://issues.apache.org/jira/browse/SPARK-27743 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.3 >Reporter: xzh_dz >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27743) alter table: bucketing
[ https://issues.apache.org/jira/browse/SPARK-27743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27743: Assignee: Apache Spark > alter table: bucketing > -- > > Key: SPARK-27743 > URL: https://issues.apache.org/jira/browse/SPARK-27743 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.3 >Reporter: xzh_dz >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27744) SubqueryExec thread pool does not preserve thread local properties
Onur Satici created SPARK-27744: --- Summary: SubqueryExec thread pool does not preserve thread local properties Key: SPARK-27744 URL: https://issues.apache.org/jira/browse/SPARK-27744 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Onur Satici SubqueryExec uses a cached thread pool of size 16. After this thread pool reaches its pool size, it will start reusing threads, and submitted tasks would be run on a thread with out of date spark local properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27743) alter table: bucketing
xzh_dz created SPARK-27743: -- Summary: alter table: bucketing Key: SPARK-27743 URL: https://issues.apache.org/jira/browse/SPARK-27743 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.4.3 Reporter: xzh_dz -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27722) Remove UnsafeKeyValueSorter
[ https://issues.apache.org/jira/browse/SPARK-27722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27722. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24622 [https://github.com/apache/spark/pull/24622] > Remove UnsafeKeyValueSorter > --- > > Key: SPARK-27722 > URL: https://issues.apache.org/jira/browse/SPARK-27722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Shivu Sondur >Priority: Minor > Fix For: 3.0.0 > > > We just moved the location of classes including {{UnsafeKeyValueSorter}}. > After further investigating, I don't find where {{UnsafeKeyValueSorter}} is > used. > If it is not used at all, shall we just remove it from codebase? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27722) Remove UnsafeKeyValueSorter
[ https://issues.apache.org/jira/browse/SPARK-27722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27722: --- Assignee: Shivu Sondur > Remove UnsafeKeyValueSorter > --- > > Key: SPARK-27722 > URL: https://issues.apache.org/jira/browse/SPARK-27722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Shivu Sondur >Priority: Minor > > We just moved the location of classes including {{UnsafeKeyValueSorter}}. > After further investigating, I don't find where {{UnsafeKeyValueSorter}} is > used. > If it is not used at all, shall we just remove it from codebase? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch
[ https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-27742: Summary: Security Support in Sources and Sinks for SS and Batch (was: Security Support in Sources and Sinks for SS and batch) > Security Support in Sources and Sinks for SS and Batch > -- > > Key: SPARK-27742 > URL: https://issues.apache.org/jira/browse/SPARK-27742 > Project: Spark > Issue Type: Brainstorming > Components: SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Stavros Kontopoulos >Priority: Major > > As discussed with [~erikerlandson] on the [Big Data on K8s > UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] > it would be good to capture current status and identify work that needs to > be done for securing Spark when accessing sources and sinks. For example what > is the status of SSL, Kerberos support in different scenarios. The big > concern nowadays is how to secure data pipelines end-to-end. > Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27742) Security Support in Sources and Sinks for SS and batch
Stavros Kontopoulos created SPARK-27742: --- Summary: Security Support in Sources and Sinks for SS and batch Key: SPARK-27742 URL: https://issues.apache.org/jira/browse/SPARK-27742 Project: Spark Issue Type: Brainstorming Components: SQL, Structured Streaming Affects Versions: 3.0.0 Reporter: Stavros Kontopoulos As discussed with [~erikerlandson] on the [Big Data on K8s UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA] it would be good to capture current status and identify work that needs to be done for securing Spark when accessing sources and sinks. For example what is the status of SSL, Kerberos support in different scenarios. The big concern nowadays is how to secure data pipelines end-to-end. Note: Not sure if this overlaps with some other ticket. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841193#comment-16841193 ] Nandor Kollar commented on SPARK-27733: --- [~hyukjin.kwon] I tried to run Spark tests after Avro upgrade, and saw several failure in spark-hive module, because Hive uses deprecated and removed Avro methods. > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841189#comment-16841189 ] Hyukjin Kwon commented on SPARK-27733: -- why is it dependent on Hive's? > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-27733: - Component/s: (was: Spark Core) SQL > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27741. -- Resolution: Duplicate > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: U Shaw >Priority: Major > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841184#comment-16841184 ] Hyukjin Kwon commented on SPARK-27741: -- [~xyxiaoyou], can you check if the same issue exists in higher version? Let me leave this resolved for now. > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: U Shaw >Priority: Major > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27733) Upgrade to Avro 1.9.x
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ismaël Mejía updated SPARK-27733: - Description: Avro 1.9.0 was released with many nice features including reduced size (1MB less), and removed dependencies, no paranmer, no shaded guava, security updates, so probably a worth upgrade. (was: Avro 1.9.0 was released with many nice features including reduced size 1MB less, and removed dependencies, no paranmer, no shaded avro, security updates, so probably a worth upgrade.) > Upgrade to Avro 1.9.x > - > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 3.0.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.0 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranmer, no shaded guava, security > updates, so probably a worth upgrade. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] U Shaw updated SPARK-27741: --- Affects Version/s: (was: 2.4.3) 2.1.1 > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.1 >Reporter: U Shaw >Priority: Major > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840122#comment-16840122 ] Ruiguang Pei edited comment on SPARK-24374 at 5/16/19 9:17 AM: --- Hi, [~mengxr],[~jiangxb1987] when I'm using Barrier Execution Mode, it seems that I can't partition my data more than the number of total cores, otherwise it will throw the exception ["Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently."]. Suppose that I have a extremely large RDD, but only 4 cores are available, which means that each partition is still too large. will it takes potential performance problems? Do you have some plans to support the scenario that more slots can be request than available? was (Author: ruiguang pei): Hi, [~mengxr] when I'm using Barrier Execution Mode, it seems that I can't partition my data more than the number of total cores, otherwise it will throw the exception ["Barrier execution mode does not allow run a barrier stage that requires more slots than the total number of slots in the cluster currently."]. Suppose that I have a extremely large RDD, but only 4 cores are available, which means that each partition is still too large. will it takes potential performance problems? Do you have some plans to support the scenario that more slots can be request than available? > SPIP: Support Barrier Execution Mode in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841087#comment-16841087 ] Yuming Wang commented on SPARK-27741: - This should be supported since SPARK-21479. > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.3 >Reporter: U Shaw >Priority: Major > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27741) Transitivity on predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-27741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] U Shaw updated SPARK-27741: --- Description: When using inner join, where conditions can be passed to join on, and when using outer join, even if the conditions are the same, only the predicate is pushed down to left or right. As follows: select * from t1 left join t2 on t1.id=t2.id where t1.id=1 --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 Is Catalyst can support transitivity on predicate pushdown ? > Transitivity on predicate pushdown > --- > > Key: SPARK-27741 > URL: https://issues.apache.org/jira/browse/SPARK-27741 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.3 >Reporter: U Shaw >Priority: Major > > When using inner join, where conditions can be passed to join on, and when > using outer join, even if the conditions are the same, only the predicate is > pushed down to left or right. > As follows: > select * from t1 left join t2 on t1.id=t2.id where t1.id=1 > --> select * from t1 left join on t1.id=t2.id and t2.id=1 where t1.id=1 > Is Catalyst can support transitivity on predicate pushdown ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27741) Transitivity on predicate pushdown
U Shaw created SPARK-27741: -- Summary: Transitivity on predicate pushdown Key: SPARK-27741 URL: https://issues.apache.org/jira/browse/SPARK-27741 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.3 Reporter: U Shaw -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org