[jira] [Assigned] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27893:


Assignee: Apache Spark

> Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql 
> files
> 
>
> Key: SPARK-27893
> URL: https://issues.apache.org/jira/browse/SPARK-27893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
>
> See https://github.com/apache/spark/pull/24675#issuecomment-495591310
> Due to lack of Python tests, and due to difficulty about writing a PySpark 
> tests for plans, we have faced multiple issues so far.
> This JIRA targets to add a test base that works with SQL files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27893:


Assignee: (was: Apache Spark)

> Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql 
> files
> 
>
> Key: SPARK-27893
> URL: https://issues.apache.org/jira/browse/SPARK-27893
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Critical
>
> See https://github.com/apache/spark/pull/24675#issuecomment-495591310
> Due to lack of Python tests, and due to difficulty about writing a PySpark 
> tests for plans, we have faced multiple issues so far.
> This JIRA targets to add a test base that works with SQL files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27893) Create an integrated test base for Python, Scalar Pandas, Scala UDF by sql files

2019-05-30 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-27893:


 Summary: Create an integrated test base for Python, Scalar Pandas, 
Scala UDF by sql files
 Key: SPARK-27893
 URL: https://issues.apache.org/jira/browse/SPARK-27893
 Project: Spark
  Issue Type: Test
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


See https://github.com/apache/spark/pull/24675#issuecomment-495591310

Due to lack of Python tests, and due to difficulty about writing a PySpark 
tests for plans, we have faced multiple issues so far.

This JIRA targets to add a test base that works with SQL files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24815) Structured Streaming should support dynamic allocation

2019-05-30 Thread Karthik Palaniappan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852541#comment-16852541
 ] 

Karthik Palaniappan edited comment on SPARK-24815 at 5/31/19 2:26 AM:
--

I was starting to update the JIRA description with a problem statement, then 
realized I am unfamiliar with some of the challenges you guys mentioned in the 
comments, in particular how state is managed in structured streaming.

I was imagining that processing rate was the correct heuristic, assuming the 
goal is to just keep up with the input, even at the expense of processing time. 
Continuous processing seems to solve the separate case where you need ultra low 
latency processing.

[~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy 
to help with the implementation, but for now I will drop this JIRA.


was (Author: karthik palaniappan):
I was starting to update the JIRA description with a problem statement, then 
realized I am unfamiliar with some of the challenges you guys mentioned in the 
comments, in particular how state is managed in structured streaming.

I was imagining that processing rate was the correct heuristic, assuming the 
goal is to keep up with the input. Continuous processing seems to solve the 
separate case where you need ultra low latency.

[~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy 
to help with the implementation, but for now I will drop this JIRA.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852573#comment-16852573
 ] 

hemshankar sahu commented on SPARK-27891:
-

Sure I'll provide for 2.3.1 in some time as it needs to run for long time.

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27876) Split large shuffle partition to multi-segments to enable transfer oversize shuffle partition block.

2019-05-30 Thread feiwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-27876:

Affects Version/s: (was: 3.1.0)
   (was: 2.4.3)
   2.3.2

> Split large shuffle partition to multi-segments to enable transfer oversize 
> shuffle partition block.
> 
>
> Key: SPARK-27876
> URL: https://issues.apache.org/jira/browse/SPARK-27876
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.3.2
>Reporter: feiwang
>Priority: Major
>
> There is a limit for shuffle read. 
> If a shuffle partition block's size is large than Integer.MaxValue(2GB) and 
> this block is fetched from remote, an Exception will be thrown.
> {code:java}
> 2019-05-24 06:46:30,333 [9935] - WARN 
> [shuffle-client-6-2:TransportChannelHandler@78] - Exception in connection 
> from hadoop3747.jd.163.org/10.196.76.172:7337
> java.lang.IllegalArgumentException: Too large frame: 2991947178
> at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.decodeNext(TransportFrameDecoder.java:133)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:81)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
> {code}
> Then this task would throw a fetchFailedException.
> This task will retry and it would execute successfully only when this task 
> was reScheduled to a executor whose host is same to this oversize shuffle 
> partition block.
> However, if there are more than one oversize(>2GB) shuffle partitions block, 
> this task would never execute successfully and it may cause the failure of 
> application.
> In this PR, I propose a new method to fetch shuffle block, it would fetch 
> multi times when the relative shuffle partition block is oversize.
> The simple brief introduction:
> 1. Set a shuffle fetch threshold(SHUFFLE_FETCH_THRESHOLD) to Int.MaxValue(2GB)
> 2. Set a parameter spark.shuffle.fetch.split to control whether enable fetch 
> large partition multi times
> 3. When creating mapStatus, caucluate the segemens of shuffle block 
> (Math.ceil(size /SHUFFLE_FETCH_THRESHOLD )), and only record the segment 
> number which is large than 1.
> 4. Define a new BlockId type, ShuffleBlockSegmentId, used to identifiy the 
> fetch method.
> 5. When spark.shuffle.fetch.split is enabled,  send ShuffleBlockSegmentId 
> message to shuffleService instead of ShuffleBlockId message.
> 6. For a ShuffleBlockId, use a sequence of ManagedBuffers to present its 
> block instead of a ManagedBuffer.
> 7. In ShuffleBlockFetcherIterator,  create a PriorityBlockQueue for a 
> ShuffleBlockId to store the fetched SegmentManagedBuffer, when all segments 
> of a ShuffleBlockId are fetched,  take  relative sequence of 
> managedBuffers(which are ordered by segmentId) as a successResult for a 
> ShuffleBlockID.
> 8. In the shuffle serivice side, if the blockId of openBlocks is a 
> ShuffleBlockSegmentId,  response a segment managedBuffer of block , if the 
> blockId is a ShuffleBlockId response a whole managedBuffer of block as before.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2019-05-30 Thread Karthik Palaniappan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852541#comment-16852541
 ] 

Karthik Palaniappan commented on SPARK-24815:
-

I was starting to update the JIRA description with a problem statement, then 
realized I am unfamiliar with some of the challenges you guys mentioned in the 
comments, in particular how state is managed in structured streaming.

I was imagining that processing rate was the correct heuristic, assuming the 
goal is to keep up with the input. Continuous processing seems to solve the 
separate case where you need ultra low latency.

[~skonto] [~kabhwan] [~gsomogyi] if you guys help with a design, I'd be happy 
to help with the implementation, but for now I will drop this JIRA.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24815) Structured Streaming should support dynamic allocation

2019-05-30 Thread Karthik Palaniappan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Palaniappan updated SPARK-24815:

Description: 
For batch jobs, dynamic allocation is very useful for adding and removing 
containers to match the actual workload. On multi-tenant clusters, it ensures 
that a Spark job is taking no more resources than necessary. In cloud 
environments, it enables autoscaling.

However, if you set spark.dynamicAllocation.enabled=true and run a structured 
streaming job, the batch dynamic allocation algorithm kicks in. It requests 
more executors if the task backlog is a certain size, and removes executors if 
they idle for a certain period of time.

Quick thoughts:

1) Dynamic allocation should be pluggable, rather than hardcoded to a 
particular implementation in SparkContext.scala (this should be a separate 
JIRA).

2) We should make a structured streaming algorithm that's separate from the 
batch algorithm. Eventually, continuous processing might need its own algorithm.

3) Spark should print a warning if you run a structured streaming job when 
Core's dynamic allocation is enabled

  was:
Dynamic allocation is very useful for adding and removing containers to match 
the actual workload. On multi-tenant clusters, it ensures that a Spark job is 
taking no more resources than necessary. In cloud environments, it enables 
autoscaling.

However, if you set spark.dynamicAllocation.enabled=true and run a structured 
streaming job, Core's dynamic allocation algorithm kicks in. It requests 
executors if the task backlog is a certain size, and remove executors if they 
idle for a certain period of time.

This does not make sense for streaming jobs, as outlined in 
https://issues.apache.org/jira/browse/SPARK-12133, which introduced dynamic 
allocation for the old streaming API.

First, Spark should print a warning if you run a structured streaming job when 
Core's dynamic allocation is enabled

Second, structured streaming should have support for dynamic allocation. It 
would be convenient if it were the same set of properties as Core's dynamic 
allocation, but I don't have a strong opinion on that.

If somebody can give me pointers on how to add dynamic allocation support, I'd 
be happy to take a stab.


> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852508#comment-16852508
 ] 

Hyukjin Kwon commented on SPARK-27884:
--

+1

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27862) Upgrade json4s-jackson to 3.6.6

2019-05-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27862:
-

Assignee: Izek Greenfield

> Upgrade json4s-jackson to 3.6.6
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27862) Upgrade json4s-jackson to 3.6.6

2019-05-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27862.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24736
[https://github.com/apache/spark/pull/24736]

> Upgrade json4s-jackson to 3.6.6
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Assignee: Izek Greenfield
>Priority: Minor
> Fix For: 3.0.0
>
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27892) Saving/loading stages in PipelineModel should be parallel

2019-05-30 Thread Jason Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Wang updated SPARK-27892:
---
Description: 
When a PipelineModel is saved/loaded, all the stages are saved/loaded 
sequentially. When dealing with a PipelineModel with many stages, although each 
stage's save/load takes sub-second, the total time taken for the PipelineModel 
could be several minutes. It should be trivial to parallelize the save/load of 
stages in the SharedReadWrite object.

 

To reproduce:
{code:java}
import org.apache.spark.ml._
import org.apache.spark.ml.feature.VectorAssembler
val outputPath = "..."
val stages = (1 to 100) map { i => new 
VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)}
val p = new Pipeline().setStages(stages.toArray)
val data = Seq(1, 1, 1) toDF "input"
val pm = p.fit(data)
pm.save(outputPath){code}

  was:When a PipelineModel is saved/loaded, all the stages are saved/loaded 
sequentially. When dealing with a PipelineModel with many stages, although each 
stage's save/load takes sub-second, the total time taken for the PipelineModel 
could be several minutes. It should be trivial to parallelize the save/load of 
stages in the SharedReadWrite object.


> Saving/loading stages in PipelineModel should be parallel
> -
>
> Key: SPARK-27892
> URL: https://issues.apache.org/jira/browse/SPARK-27892
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Jason Wang
>Priority: Minor
>  Labels: easyfix, performance
>
> When a PipelineModel is saved/loaded, all the stages are saved/loaded 
> sequentially. When dealing with a PipelineModel with many stages, although 
> each stage's save/load takes sub-second, the total time taken for the 
> PipelineModel could be several minutes. It should be trivial to parallelize 
> the save/load of stages in the SharedReadWrite object.
>  
> To reproduce:
> {code:java}
> import org.apache.spark.ml._
> import org.apache.spark.ml.feature.VectorAssembler
> val outputPath = "..."
> val stages = (1 to 100) map { i => new 
> VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)}
> val p = new Pipeline().setStages(stages.toArray)
> val data = Seq(1, 1, 1) toDF "input"
> val pm = p.fit(data)
> pm.save(outputPath){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27892) Saving/loading stages in PipelineModel should be parallel

2019-05-30 Thread Jason Wang (JIRA)
Jason Wang created SPARK-27892:
--

 Summary: Saving/loading stages in PipelineModel should be parallel
 Key: SPARK-27892
 URL: https://issues.apache.org/jira/browse/SPARK-27892
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.4.3
Reporter: Jason Wang


When a PipelineModel is saved/loaded, all the stages are saved/loaded 
sequentially. When dealing with a PipelineModel with many stages, although each 
stage's save/load takes sub-second, the total time taken for the PipelineModel 
could be several minutes. It should be trivial to parallelize the save/load of 
stages in the SharedReadWrite object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-30 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-27684:
--

Assignee: Marco Gaido

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Marco Gaido
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-30 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-27684.

   Resolution: Fixed
Fix Version/s: 3.0.0

Fixed for 3.0 in https://github.com/apache/spark/pull/24636

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27890:


Assignee: Apache Spark

> Improve SQL parser error message when missing backquotes for identifiers with 
> hyphens
> -
>
> Key: SPARK-27890
> URL: https://issues.apache.org/jira/browse/SPARK-27890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Assignee: Apache Spark
>Priority: Major
>
> Current SQL parser's error message for hyphen-connected identifiers without 
> surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
> possible approach to tackle this is to explicitly capture these wrong usages 
> in the SQL parser. In this way, the end users can fix these errors more 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27890:


Assignee: (was: Apache Spark)

> Improve SQL parser error message when missing backquotes for identifiers with 
> hyphens
> -
>
> Key: SPARK-27890
> URL: https://issues.apache.org/jira/browse/SPARK-27890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yesheng Ma
>Priority: Major
>
> Current SQL parser's error message for hyphen-connected identifiers without 
> surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
> possible approach to tackle this is to explicitly capture these wrong usages 
> in the SQL parser. In this way, the end users can fix these errors more 
> quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27772:
--
Component/s: Tests

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> I believe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  
>  
> Java's `try-with-resources` statement does not mask exception throwing in the 
> try block with any exception caught in the 'close()' statement. Exception 
> caught in the 'close()' statement would add as a suppressed exception 
> instead. It sounds a better approach.
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
>   var fromTryBlock: Throwable = null
>   try tryBlock catch {
> case cause: Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock: Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developers to 
> identify what actually break their tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852420#comment-16852420
 ] 

Marcelo Vanzin commented on SPARK-27891:


Ok, the updated logs show the issue. But they're from Spark 2.2.0, which is 
EOL. If you can provide logs from the lastest 2.3 or 2.4 releases, that would 
be more helpful (since there's been a few changes in this area).

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu updated SPARK-27891:

Description: 
When the spark job runs on a secured cluster for longer then time that is 
mentioned in the dfs.namenode.delegation.token.renew-interval property of 
hdfs-site.xml the spark job fails. ** 

Following command was used to submit the spark job

bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
--master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
/tmp/ff1.txt

 

Application Logs attached

 

  was:
When the spark job runs on a secured cluster for longer then time that is 
mentioned in the dfs.namenode.delegation.token.renew-interval property of 
hdfs-site.xml the spark job fails. ** 

Following command was used to submit the spark job

bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
--master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
/tmp/ff1.txt

 

Application Logs pasted in Docs Text

 


> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu updated SPARK-27891:

Attachment: application_1559242207407_0001.log

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
> Attachments: application_1559242207407_0001.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs pasted in Docs Text
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852408#comment-16852408
 ] 

Marcelo Vanzin commented on SPARK-27891:


{{container_e48_1559242207407_0001_02_01}} tells me that's the second 
attempt. That makes your problem the same as SPARK-23361.

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs pasted in Docs Text
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu updated SPARK-27891:

Description: 
When the spark job runs on a secured cluster for longer then time that is 
mentioned in the dfs.namenode.delegation.token.renew-interval property of 
hdfs-site.xml the spark job fails. ** 

Following command was used to submit the spark job

bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
--master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
/tmp/ff1.txt

 

Application Logs pasted in Docs Text

 

  was:
When the spark job runs on a secured cluster for longer then time that is 
mentioned in the dfs.namenode.delegation.token.renew-interval property of 
hdfs-site.xml the spark job fails. ** 

 

 

Application Logs pasted in Docs Text

 


> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs pasted in Docs Text
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hemshankar sahu reopened SPARK-27891:
-

We used following command to submit the job

bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
--master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
/tmp/ff1.txt

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
>  
>  
> Application Logs pasted in Docs Text
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-05-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852402#comment-16852402
 ] 

Dongjoon Hyun edited comment on SPARK-27812 at 5/30/19 10:06 PM:
-

Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, 
[~igor.calabria].

Can we move forward to resolve the issue by upgrading the libraries?


was (Author: dongjoon):
Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, 
[~igor.calabria].

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-05-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852402#comment-16852402
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Thank you for reporting, [~Andrew HUALI] and thank you for the investigation, 
[~igor.calabria].

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2019-05-30 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov updated SPARK-22151:
--
Comment: was deleted

(was: Is there is a workaround for this in Apache Livy? 
We're still on Spark 2.3 ..)

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27891.

Resolution: Not A Problem

You have to provide a keytab for Spark for this to work.

That's explained in the docs:
http://spark.apache.org/docs/latest/security.html#kerberos

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Major
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
>  
>  
> Application Logs pasted in Docs Text
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-05-30 Thread hemshankar sahu (JIRA)
hemshankar sahu created SPARK-27891:
---

 Summary: Long running spark jobs fail because of HDFS delegation 
token expires
 Key: SPARK-27891
 URL: https://issues.apache.org/jira/browse/SPARK-27891
 Project: Spark
  Issue Type: Bug
  Components: Security
Affects Versions: 2.4.1, 2.3.1, 2.1.0, 2.0.1
Reporter: hemshankar sahu


When the spark job runs on a secured cluster for longer then time that is 
mentioned in the dfs.namenode.delegation.token.renew-interval property of 
hdfs-site.xml the spark job fails. ** 

 

 

Application Logs pasted in Docs Text

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27361) YARN support for GPU-aware scheduling

2019-05-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27361.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

all the subtasks are finished and the parts of the spark on hadoop 3.x that we 
need are also merged so resolving this.

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement YARN support for GPU-aware scheduling:
>  * User can request GPU resources at Spark application level.
>  * How the Spark executor discovers GPU's when run on YARN
>  * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22151) PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly

2019-05-30 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852381#comment-16852381
 ] 

Ruslan Dautkhanov commented on SPARK-22151:
---

Is there is a workaround for this in Apache Livy? 
We're still on Spark 2.3 ..

> PYTHONPATH not picked up from the spark.yarn.appMasterEnv properly
> --
>
> Key: SPARK-22151
> URL: https://issues.apache.org/jira/browse/SPARK-22151
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Thomas Graves
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.4.0
>
>
> Running in yarn cluster mode and trying to set pythonpath via 
> spark.yarn.appMasterEnv.PYTHONPATH doesn't work.
> the yarn Client code looks at the env variables:
> val pythonPathStr = (sys.env.get("PYTHONPATH") ++ pythonPath)
> But when you set spark.yarn.appMasterEnv it puts it into the local env. 
> So the python path set in spark.yarn.appMasterEnv isn't properly set.
> You can work around if you are running in cluster mode by setting it on the 
> client like:
> PYTHONPATH=./addon/python/ spark-submit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27890) Improve SQL parser error message when missing backquotes for identifiers with hyphens

2019-05-30 Thread Yesheng Ma (JIRA)
Yesheng Ma created SPARK-27890:
--

 Summary: Improve SQL parser error message when missing backquotes 
for identifiers with hyphens
 Key: SPARK-27890
 URL: https://issues.apache.org/jira/browse/SPARK-27890
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Yesheng Ma


Current SQL parser's error message for hyphen-connected identifiers without 
surrounding backquotes(e.g. {{hyphen-table}}) is confusing for end users. A 
possible approach to tackle this is to explicitly capture these wrong usages in 
the SQL parser. In this way, the end users can fix these errors more quickly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27872:


Assignee: Apache Spark

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Assignee: Apache Spark
>Priority: Major
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27872) Driver and executors use a different service account breaking pull secrets

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27872:


Assignee: (was: Apache Spark)

> Driver and executors use a different service account breaking pull secrets
> --
>
> Key: SPARK-27872
> URL: https://issues.apache.org/jira/browse/SPARK-27872
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Driver and executors use different service accounts in case the driver has 
> one set up which is different than default: 
> [https://gist.github.com/skonto/9beb5afa2ec4659ba563cbb0a8b9c4dd]
> This makes the executor pods fail when the user links the driver service 
> account with a pull secret: 
> [https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#add-imagepullsecrets-to-a-service-account].
>  Executors will not use the driver's service account and will not be able to 
> get the secret in order to pull the related image. 
> I am not sure what is the assumption here for using the default account for 
> executors, probably because of the fact that this account is limited (btw 
> executors dont create resources)? This is an inconsistency that could be 
> worked around with the pod template feature in Spark 3.0.0 but it breaks pull 
> secrets and in general I think its a bug to have it. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-30 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852365#comment-16852365
 ] 

Dhruve Ashar commented on SPARK-24149:
--

IMHO having a consistent way to reason about a behavior is more preferable over 
specifying an additional config.

We encountered an issue while trying to get the filesystem using a logical 
nameservice (That is one more reason why this is broken) based on => 
[https://github.com/apache/spark/pull/21216/files#diff-f8659513cf91c15097428c3d8dfbcc35R213]

 

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler

2019-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27773.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24645
[https://github.com/apache/spark/pull/24645]

> Add shuffle service metric for number of exceptions caught in 
> ExternalShuffleBlockHandler
> -
>
> Key: SPARK-27773
> URL: https://issues.apache.org/jira/browse/SPARK-27773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.3
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Minor
> Fix For: 3.0.0
>
>
> The health of the external shuffle service is currently difficult to monitor. 
> At least for the YARN shuffle service, the only current indication of health 
> is whether or not the shuffle service threads are running in the NodeManager. 
> However, we've seen that clients can sometimes experience elevated failure 
> rates on requests to the shuffle service even when those threads are running. 
> It would be helpful to have some indication of how often requests to the 
> shuffle service are failing, as this could be monitored, alerted on, etc.
> One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
> metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
> how many times we caught an exception in the shuffle service's RPC handler. I 
> think that this gives us the insight into request failure rates that we're 
> currently missing, but obviously I'm open to alternatives as well if people 
> have other ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler

2019-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27773:
--

Assignee: Steven Rand

> Add shuffle service metric for number of exceptions caught in 
> ExternalShuffleBlockHandler
> -
>
> Key: SPARK-27773
> URL: https://issues.apache.org/jira/browse/SPARK-27773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.3
>Reporter: Steven Rand
>Assignee: Steven Rand
>Priority: Minor
>
> The health of the external shuffle service is currently difficult to monitor. 
> At least for the YARN shuffle service, the only current indication of health 
> is whether or not the shuffle service threads are running in the NodeManager. 
> However, we've seen that clients can sometimes experience elevated failure 
> rates on requests to the shuffle service even when those threads are running. 
> It would be helpful to have some indication of how often requests to the 
> shuffle service are failing, as this could be monitored, alerted on, etc.
> One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
> metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
> how many times we caught an exception in the shuffle service's RPC handler. I 
> think that this gives us the insight into request failure rates that we're 
> currently missing, but obviously I'm open to alternatives as well if people 
> have other ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2019-05-30 Thread Igor Calabria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852342#comment-16852342
 ] 

Igor Calabria commented on SPARK-27812:
---

I believe this was introduced when `kubernetes-client` was updated in 
https://issues.apache.org/jira/browse/SPARK-26742 . I can confirm that 
reverting 
[https://github.com/apache/spark/commit/2d4e9cf84b85a5f8278276e8d8ff59f6f4b11c4c]
 fixes this issue. This is funny because it was already discussed here (2017): 
[https://github.com/apache-spark-on-k8s/spark/pull/216#discussion_r112641765 
|https://github.com/apache-spark-on-k8s/spark/pull/216#discussion_r112641765.]

The previous solution was setting 
`.withWebsocketPingInterval({color:#6897bb}0{color})` option when instantiating 
the client. This prevented the user thread from being created. Maybe something 
changed with okhttp or kubernetes-client and this is no longer the case.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: Henry Yu
>Priority: Major
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27378) spark-submit requests GPUs in YARN mode

2019-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27378.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24634
[https://github.com/apache/spark/pull/24634]

> spark-submit requests GPUs in YARN mode
> ---
>
> Key: SPARK-27378
> URL: https://issues.apache.org/jira/browse/SPARK-27378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27378) spark-submit requests GPUs in YARN mode

2019-05-30 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27378:
--

Assignee: Thomas Graves

> spark-submit requests GPUs in YARN mode
> ---
>
> Key: SPARK-27378
> URL: https://issues.apache.org/jira/browse/SPARK-27378
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Submit, YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852276#comment-16852276
 ] 

Xiangrui Meng commented on SPARK-27886:
---

By "at the end of year", you mean year 2020, right? The proposed timeline sets 
the switch to 2020/04, a rough estimate of Spark 3.1 release.

> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-05-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852268#comment-16852268
 ] 

Dongjoon Hyun commented on SPARK-27798:
---

Thank you so much for the investigation and update, [~Gengliang.Wang].

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852238#comment-16852238
 ] 

Dongjoon Hyun commented on SPARK-27884:
---

Thanks! +1 for this efforts

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852166#comment-16852166
 ] 

Sean Owen commented on SPARK-27886:
---

Maybe I misunderstand, but if it's likely Python 2 goes away from Spark in 3.1, 
and that is before the end of the year probably, don't we want to set the 
expectation that Python 2 support will go away at the end of the year? (if it 
happens to be later, that's better than going away earlier than advertised)

> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852149#comment-16852149
 ] 

Sean Owen commented on SPARK-27884:
---

Yes that looks fine to me.

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread William Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852148#comment-16852148
 ] 

William Wong commented on SPARK-27772:
--

Hi [~hyukjin.kwon], I submitted the PR and add a test case to demonstrate how 
the change behaves. Basically, one of the example is if something create a test 
by providing 'null' table or cache for those WithXXX method, we should hit a 
null pointer exception in related closing block (finally block). the null 
pointer exception would mask the true exception. 

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> I believe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  
>  
> Java's `try-with-resources` statement does not mask exception throwing in the 
> try block with any exception caught in the 'close()' statement. Exception 
> caught in the 'close()' statement would add as a suppressed exception 
> instead. It sounds a better approach.
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
>   var fromTryBlock: Throwable = null
>   try tryBlock catch {
> case cause: Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock: Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developers to 
> identify what actually break their tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27889) Make development scripts under dev/ support Python 3

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27889:
-

 Summary: Make development scripts under dev/ support Python 3
 Key: SPARK-27889
 URL: https://issues.apache.org/jira/browse/SPARK-27889
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Deploy
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Some of our internal python scripts under dev/ only support Python 2. With 
deprecation of Python 2, we should make those scripts support Python 3. So 
developers have a way to avoid seeing the deprecation warning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852136#comment-16852136
 ] 

Gabor Somogyi commented on SPARK-27648:
---

I've made my graphs based on the files you've provided and both states looks 
stable in terms of memory consumption.
I would suggest 2 memory dumps (one at the beginning and one at the end) and 
compare them (maybe with Memory Analyzer Tool). The problematic part can be 
nearly anything.

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> houragg_with_state1_state2.csv, houragg_with_state1_state2.xlsx, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27772:
-
Description: 
The current `SQLTestUtils` created many `withXXX` utility functions to clean up 
tables/views/caches created for testing purpose. Some of those `withXXX` 
functions ignore certain exceptions, like `NoSuchTableException` in the clean 
up block (ie, the finally block). 

 
{code:java}
/**
 * Drops temporary view `viewNames` after calling `f`.
 */
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  try f finally {
// If the test failed part way, we don't want to mask the failure by 
failing to remove
// temp views that never got created.
try viewNames.foreach(spark.catalog.dropTempView) catch {
  case _: NoSuchTableException =>
}
  }
}
{code}
I believe it is not the best approach. Because it is hard to anticipate what 
exception should or should not be ignored.  

 

Java's `try-with-resources` statement does not mask exception throwing in the 
try block with any exception caught in the 'close()' statement. Exception 
caught in the 'close()' statement would add as a suppressed exception instead. 
It sounds a better approach.

 

Therefore, I proposed to standardise those 'withXXX' function with following 
`withFinallyBlock` function, which does something similar to Java's 
try-with-resources statement. 
{code:java}
/**
* Drops temporary view `viewNames` after calling `f`.
*/
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  tryWithFinally(f)(viewNames.foreach(spark.catalog.dropTempView))
}


/**
 * Executes the given tryBlock and then the given finallyBlock no matter 
whether tryBlock throws
 * an exception. If both tryBlock and finallyBlock throw exceptions, the 
exception thrown
 * from the finallyBlock with be added to the exception thrown from tryBlock as 
a
 * suppress exception. It helps to avoid masking the exception from tryBlock 
with exception
 * from finallyBlock
 */
private def tryWithFinally(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
  var fromTryBlock: Throwable = null
  try tryBlock catch {
case cause: Throwable =>
  fromTryBlock = cause
  throw cause
  } finally {
if (fromTryBlock != null) {
  try finallyBlock catch {
case fromFinallyBlock: Throwable =>
  fromTryBlock.addSuppressed(fromFinallyBlock)
  throw fromTryBlock
  }
} else {
  finallyBlock
}
  }
}
{code}
If a feature is well written, we show not hit any exception in those closing 
method in testcase. The purpose of this proposal is to help developers to 
identify what actually break their tests.

  was:
The current `SQLTestUtils` created many `withXXX` utility functions to clean up 
tables/views/caches created for testing purpose. Some of those `withXXX` 
functions ignore certain exceptions, like `NoSuchTableException` in the clean 
up block (ie, the finally block). 

 
{code:java}
/**
 * Drops temporary view `viewNames` after calling `f`.
 */
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  try f finally {
// If the test failed part way, we don't want to mask the failure by 
failing to remove
// temp views that never got created.
try viewNames.foreach(spark.catalog.dropTempView) catch {
  case _: NoSuchTableException =>
}
  }
}
{code}
Maybe it is not the best approach. Because it is hard to anticipate what 
exception should or should not be ignored.  Java's `try-with-resources` 
statement does not mask exception throwing in the try block with any exception 
caught in the 'close()' statement. Exception caught in the 'close()' statement 
would add as a suppressed exception instead. IMHO, it is a better approach.  

 

Therefore, I proposed to standardise those 'withXXX' function with following 
`withFinallyBlock` function, which does something similar to Java's 
try-with-resources statement. 
{code:java}
/**
* Drops temporary view `viewNames` after calling `f`.
*/
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
}


/**
 * Executes the given tryBlock and then the given finallyBlock no matter 
whether tryBlock throws
 * an exception. If both tryBlock and finallyBlock throw exceptions, the 
exception thrown
 * from the finallyBlock with be added to the exception thrown from tryBlock as 
a
 * suppress exception. It helps to avoid masking the exception from tryBlock 
with exception
 * from finallyBlock
 */
private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
  var fromTryBlock : Throwable = null
  try tryBlock catch {
case cause : Throwable =>
  fromTryBlock = cause
  throw cause
  } finally {
if (fromTryBlock != null) {
  try finallyBlock catch {
case fromFinallyBlock : Throwable =>
  

[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852125#comment-16852125
 ] 

Apache Spark commented on SPARK-27772:
--

User 'William1104' has created a pull request for this issue:
https://github.com/apache/spark/pull/24746

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> Maybe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  Java's `try-with-resources` 
> statement does not mask exception throwing in the try block with any 
> exception caught in the 'close()' statement. Exception caught in the 
> 'close()' statement would add as a suppressed exception instead. IMHO, it is 
> a better approach.  
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit 
> = {
>   var fromTryBlock : Throwable = null
>   try tryBlock catch {
> case cause : Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock : Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
>  
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developer to 
> identify what may break in the test case. I believe masking the original 
> exception with any other exception is not the best approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27772:


Assignee: Apache Spark

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Assignee: Apache Spark
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> Maybe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  Java's `try-with-resources` 
> statement does not mask exception throwing in the try block with any 
> exception caught in the 'close()' statement. Exception caught in the 
> 'close()' statement would add as a suppressed exception instead. IMHO, it is 
> a better approach.  
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit 
> = {
>   var fromTryBlock : Throwable = null
>   try tryBlock catch {
> case cause : Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock : Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
>  
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developer to 
> identify what may break in the test case. I believe masking the original 
> exception with any other exception is not the best approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27772:


Assignee: (was: Apache Spark)

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> Maybe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  Java's `try-with-resources` 
> statement does not mask exception throwing in the try block with any 
> exception caught in the 'close()' statement. Exception caught in the 
> 'close()' statement would add as a suppressed exception instead. IMHO, it is 
> a better approach.  
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit 
> = {
>   var fromTryBlock : Throwable = null
>   try tryBlock catch {
> case cause : Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock : Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
>  
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developer to 
> identify what may break in the test case. I believe masking the original 
> exception with any other exception is not the best approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852124#comment-16852124
 ] 

Apache Spark commented on SPARK-27772:
--

User 'William1104' has created a pull request for this issue:
https://github.com/apache/spark/pull/24746

> SQLTestUtils Refactoring
> 
>
> Key: SPARK-27772
> URL: https://issues.apache.org/jira/browse/SPARK-27772
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: William Wong
>Priority: Minor
>
> The current `SQLTestUtils` created many `withXXX` utility functions to clean 
> up tables/views/caches created for testing purpose. Some of those `withXXX` 
> functions ignore certain exceptions, like `NoSuchTableException` in the clean 
> up block (ie, the finally block). 
>  
> {code:java}
> /**
>  * Drops temporary view `viewNames` after calling `f`.
>  */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   try f finally {
> // If the test failed part way, we don't want to mask the failure by 
> failing to remove
> // temp views that never got created.
> try viewNames.foreach(spark.catalog.dropTempView) catch {
>   case _: NoSuchTableException =>
> }
>   }
> }
> {code}
> Maybe it is not the best approach. Because it is hard to anticipate what 
> exception should or should not be ignored.  Java's `try-with-resources` 
> statement does not mask exception throwing in the try block with any 
> exception caught in the 'close()' statement. Exception caught in the 
> 'close()' statement would add as a suppressed exception instead. IMHO, it is 
> a better approach.  
>  
> Therefore, I proposed to standardise those 'withXXX' function with following 
> `withFinallyBlock` function, which does something similar to Java's 
> try-with-resources statement. 
> {code:java}
> /**
> * Drops temporary view `viewNames` after calling `f`.
> */
> protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
>   withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
> }
> /**
>  * Executes the given tryBlock and then the given finallyBlock no matter 
> whether tryBlock throws
>  * an exception. If both tryBlock and finallyBlock throw exceptions, the 
> exception thrown
>  * from the finallyBlock with be added to the exception thrown from tryBlock 
> as a
>  * suppress exception. It helps to avoid masking the exception from tryBlock 
> with exception
>  * from finallyBlock
>  */
> private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit 
> = {
>   var fromTryBlock : Throwable = null
>   try tryBlock catch {
> case cause : Throwable =>
>   fromTryBlock = cause
>   throw cause
>   } finally {
> if (fromTryBlock != null) {
>   try finallyBlock catch {
> case fromFinallyBlock : Throwable =>
>   fromTryBlock.addSuppressed(fromFinallyBlock)
>   throw fromTryBlock
>   }
> } else {
>   finallyBlock
> }
>   }
> }
> {code}
>  
> If a feature is well written, we show not hit any exception in those closing 
> method in testcase. The purpose of this proposal is to help developer to 
> identify what may break in the test case. I believe masking the original 
> exception with any other exception is not the best approach.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27772) SQLTestUtils Refactoring

2019-05-30 Thread William Wong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

William Wong updated SPARK-27772:
-
Description: 
The current `SQLTestUtils` created many `withXXX` utility functions to clean up 
tables/views/caches created for testing purpose. Some of those `withXXX` 
functions ignore certain exceptions, like `NoSuchTableException` in the clean 
up block (ie, the finally block). 

 
{code:java}
/**
 * Drops temporary view `viewNames` after calling `f`.
 */
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  try f finally {
// If the test failed part way, we don't want to mask the failure by 
failing to remove
// temp views that never got created.
try viewNames.foreach(spark.catalog.dropTempView) catch {
  case _: NoSuchTableException =>
}
  }
}
{code}
Maybe it is not the best approach. Because it is hard to anticipate what 
exception should or should not be ignored.  Java's `try-with-resources` 
statement does not mask exception throwing in the try block with any exception 
caught in the 'close()' statement. Exception caught in the 'close()' statement 
would add as a suppressed exception instead. IMHO, it is a better approach.  

 

Therefore, I proposed to standardise those 'withXXX' function with following 
`withFinallyBlock` function, which does something similar to Java's 
try-with-resources statement. 
{code:java}
/**
* Drops temporary view `viewNames` after calling `f`.
*/
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
}


/**
 * Executes the given tryBlock and then the given finallyBlock no matter 
whether tryBlock throws
 * an exception. If both tryBlock and finallyBlock throw exceptions, the 
exception thrown
 * from the finallyBlock with be added to the exception thrown from tryBlock as 
a
 * suppress exception. It helps to avoid masking the exception from tryBlock 
with exception
 * from finallyBlock
 */
private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
  var fromTryBlock : Throwable = null
  try tryBlock catch {
case cause : Throwable =>
  fromTryBlock = cause
  throw cause
  } finally {
if (fromTryBlock != null) {
  try finallyBlock catch {
case fromFinallyBlock : Throwable =>
  fromTryBlock.addSuppressed(fromFinallyBlock)
  throw fromTryBlock
  }
} else {
  finallyBlock
}
  }
}
{code}
 

If a feature is well written, we show not hit any exception in those closing 
method in testcase. The purpose of this proposal is to help developer to 
identify what may break in the test case. I believe masking the original 
exception with any other exception is not the best approach.

  was:
The current `SQLTestUtils` created many `withXXX` utility functions to clean up 
tables/views/caches created for testing purpose. Some of those `withXXX` 
functions ignore certain exceptions, like `NoSuchTableException` in the clean 
up block (ie, the finally block). 

 
{code:java}
/**
 * Drops temporary view `viewNames` after calling `f`.
 */
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  try f finally {
// If the test failed part way, we don't want to mask the failure by 
failing to remove
// temp views that never got created.
try viewNames.foreach(spark.catalog.dropTempView) catch {
  case _: NoSuchTableException =>
}
  }
}
{code}
Maybe it is not the best approach. Because it is hard to anticipate what 
exception should or should not be ignored.  Java's `try-with-resources` 
statement does not mask exception throwing in the try block with any exception 
caught in the 'close()' statement. Exception caught in the 'close()' statement 
would add as a suppressed exception instead. IMHO, it is a better approach.  

 

Therefore, I proposed to standardise those 'withXXX' function with following 
`withFinallyBlock` function, which does something similar to Java's 
try-with-resources statement. 
{code:java}
/**
* Drops temporary view `viewNames` after calling `f`.
*/
protected def withTempView(viewNames: String*)(f: => Unit): Unit = {
  withFinallyBlock(f)(viewNames.foreach(spark.catalog.dropTempView))
}


/**
 * Executes the given tryBlock and then the given finallyBlock no matter 
whether tryBlock throws
 * an exception. If both tryBlock and finallyBlock throw exceptions, the 
exception thrown
 * from the finallyBlock with be added to the exception thrown from tryBlock as 
a
 * suppress exception. It helps to avoid masking the exception from tryBlock 
with exception
 * from finallyBlock
 */
private def withFinallyBlock(tryBlock: => Unit)(finallyBlock: => Unit): Unit = {
  var fromTryBlock : Throwable = null
  try tryBlock catch {
case cause : Throwable =>
  fromTryBlock = cause
  throw cause
  } finally {
if (fromTryBlock != null) {
  try 

[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-05-30 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852089#comment-16852089
 ] 

Josh Rosen commented on SPARK-27785:


I think this might require a little bit more design work before beginning 
implementation, especially along the following dimensions:
 # What arity do we max out at? At work I have some use-cases involving joins 
of 10+ tables.
 # Do we put this into {{Dataset}}? Or factor it out into a separate 
{{Multijoin}} helper object? Is the precedence for this in other frameworks 
(Scalding, Flink, etc) which we could mirror?
 # Instead of writing out each case by hand, can we write some Scala code to 
generate the signatures / code for us (similar to how the {{ScalaUDF}} 
overloads were defined)?
 # In addition to saving some typing / projection, does this new API let us 
solve SPARK-19468 for a limited subset of cases?

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27885:
--
Description: 
* Draft the message.
* Update Spark website and announce deprecation of Python 2 support in the next 
major release in 2019 and remove the support in a release after 2020/01/01. It 
should show up in the "Latest News" section.
* Announce it on users@ and dev@

  was:
* Draft the message.
* Update Spark website and announce deprecation of Python 2 support in the next 
major release in 2019 and remove the support in a release after 2020/01/01.
* Announce it on users@ and dev@


> Announce deprecation of Python 2 support
> 
>
> Key: SPARK-27885
> URL: https://issues.apache.org/jira/browse/SPARK-27885
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> * Draft the message.
> * Update Spark website and announce deprecation of Python 2 support in the 
> next major release in 2019 and remove the support in a release after 
> 2020/01/01. It should show up in the "Latest News" section.
> * Announce it on users@ and dev@



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852088#comment-16852088
 ] 

Xiangrui Meng commented on SPARK-27886:
---

cc: [~srowen] [~smilegator]

> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27831:


Assignee: (was: Apache Spark)

> Move Hive test jars to maven dependency
> ---
>
> Key: SPARK-27831
> URL: https://issues.apache.org/jira/browse/SPARK-27831
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27886:
--
Description: 
Add Spark to https://python3statement.org/ and indicate our timeline. I 
reviewed the statement at https://python3statement.org/. Most projects listed 
there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
only [one 
exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
 We certainly cannot drop Python 2 support in 2019 given we haven't deprecated 
it yet.

Maybe we can add the following time line:
* 2019/10 - 2020/04: Python 2 & 3
* 2020/04 - : Python 3 only

The switching is the next release after Spark 3.0. If we want to hold another 
release, it would be Sept or Oct.

  was:
Add Spark to https://python3statement.org/ and indicate our timeline. I 
reviewed the statement at https://python3statement.org/. Most projects listed 
there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
only [one 
exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
 We certainly cannot drop Python 2 support in 2019 given we haven't deprecated 
it yet.

Maybe we can add the following time line:
* 2019/10 - 2020/04: Python 2 & 3
* 2020/04 - : Python 3 only


> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only
> The switching is the next release after Spark 3.0. If we want to hold another 
> release, it would be Sept or Oct.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27831:


Assignee: (was: Apache Spark)

> Move Hive test jars to maven dependency
> ---
>
> Key: SPARK-27831
> URL: https://issues.apache.org/jira/browse/SPARK-27831
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27831) Move Hive test jars to maven dependency

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27831:


Assignee: Apache Spark

> Move Hive test jars to maven dependency
> ---
>
> Key: SPARK-27831
> URL: https://issues.apache.org/jira/browse/SPARK-27831
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27886:
--
Description: 
Add Spark to https://python3statement.org/ and indicate our timeline. I 
reviewed the statement at https://python3statement.org/. Most projects listed 
there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
only [one 
exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
 We certainly cannot drop Python 2 support in 2019 given we haven't deprecated 
it yet.

Maybe we can add the following time line:
* 2019/10 - 2020/04: Python 2 & 3
* 2020/04 - : Python 3 only

  was:Add Spark to https://python3statement.org/ and indicate our timeline.


> Add Apache Spark project to https://python3statement.org/
> -
>
> Key: SPARK-27886
> URL: https://issues.apache.org/jira/browse/SPARK-27886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Add Spark to https://python3statement.org/ and indicate our timeline. I 
> reviewed the statement at https://python3statement.org/. Most projects listed 
> there will *drop* Python 2 before 2020/01/01 instead of deprecating it with 
> only [one 
> exception|https://github.com/python3statement/python3statement.github.io/blob/6ccacf8beb3cc49b1b3d572ab4f841e250853ca9/site.js#L206].
>  We certainly cannot drop Python 2 support in 2019 given we haven't 
> deprecated it yet.
> Maybe we can add the following time line:
> * 2019/10 - 2020/04: Python 2 & 3
> * 2020/04 - : Python 3 only



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27831) Move Hive test jars to maven dependency

2019-05-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-27831:
---
  Assignee: (was: Yuming Wang)

> Move Hive test jars to maven dependency
> ---
>
> Key: SPARK-27831
> URL: https://issues.apache.org/jira/browse/SPARK-27831
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852077#comment-16852077
 ] 

Xiangrui Meng commented on SPARK-27884:
---

[~srowen] Could you help review the draft? Feel free to edit.

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27813) DataSourceV2: Add DropTable logical operation

2019-05-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27813.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24686
[https://github.com/apache/spark/pull/24686]

> DataSourceV2: Add DropTable logical operation
> -
>
> Key: SPARK-27813
> URL: https://issues.apache.org/jira/browse/SPARK-27813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
> Fix For: 3.0.0
>
>
> Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852071#comment-16852071
 ] 

Xiangrui Meng commented on SPARK-27884:
---

Draft message:

*Apache Spark's plan for dropping Python 2 support*

As many of you already knew, Python core development team and many utilized 
Python packages like Pandas and NumPy will drop Python 2 support in or before 
2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 
release in 2015. However, maintaining Python 2/3 compatibility is an increasing 
burden and it essentially limits the use of Python 3 features in Spark. Given 
the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 
2 support as well. The current plan is as follows:
 * In the next major release in 2019, we will deprecate Python 2 support. 
PySpark users will see a deprecation warning if Python 2 is used. We will 
publish a migration guide for PySpark users to migrate to Python 3.
 * We will drop Python 2 support in a future release in 2020, after Python 2 
EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
 * For releases that support Python 2, e.g., Spark 2.4, their patch releases 
will continue supporting Python 2. However, after Python 2 EOL, we might not 
take patches that are specific to Python 2.

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27813) DataSourceV2: Add DropTable logical operation

2019-05-30 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27813:
---

Assignee: John Zhuge

> DataSourceV2: Add DropTable logical operation
> -
>
> Key: SPARK-27813
> URL: https://issues.apache.org/jira/browse/SPARK-27813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
>
> Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-05-30 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852055#comment-16852055
 ] 

Gengliang Wang commented on SPARK-27798:


Also, the issue can be reproduced on latest master if we use spark-shell.

> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25557) ORC predicate pushdown for nested fields

2019-05-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852053#comment-16852053
 ] 

Dongjoon Hyun commented on SPARK-25557:
---

Nope. The nested scheme pruning is independent from predicate pushdown. This 
JIRA is waiting for Parquet predicate pushdown.

> ORC predicate pushdown for nested fields
> 
>
> Key: SPARK-25557
> URL: https://issues.apache.org/jira/browse/SPARK-25557
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27862) Upgrade json4s-jackson to 3.6.6

2019-05-30 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-27862:

Summary: Upgrade json4s-jackson to 3.6.6  (was: Upgrade json4s-jackson to 
3.6.5)

> Upgrade json4s-jackson to 3.6.6
> ---
>
> Key: SPARK-27862
> URL: https://issues.apache.org/jira/browse/SPARK-27862
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Izek Greenfield
>Priority: Minor
>
> it will be very good to upgrade to newer version



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27884:
--
Description: Officially deprecate Python 2 support in Spark 3.0.  (was: 
Officially deprecate Python 2 support in Spark 3.0 and discuss the timeline to 
drop Python 2 support in a future release.)

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27885:
--
Description: 
* Draft the message.
* Update Spark website and announce deprecation of Python 2 support in the next 
major release in 2019 and remove the support in a release after 2020/01/01.
* Announce it on users@ and dev@

  was:Update Spark website and announce deprecation of Python 2 support in the 
next major release in 2019 and remove the support in a release after 2020/01/01.


> Announce deprecation of Python 2 support
> 
>
> Key: SPARK-27885
> URL: https://issues.apache.org/jira/browse/SPARK-27885
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> * Draft the message.
> * Update Spark website and announce deprecation of Python 2 support in the 
> next major release in 2019 and remove the support in a release after 
> 2020/01/01.
> * Announce it on users@ and dev@



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27888) Python 2->3 migration guide for PySpark users

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27888:
-

 Summary: Python 2->3 migration guide for PySpark users
 Key: SPARK-27888
 URL: https://issues.apache.org/jira/browse/SPARK-27888
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


We might need a short Python 2->3 migration guide for PySpark users. It doesn't 
need to be comprehensive given many Python 2->3 migration guides around. We 
just need some pointers and list items that are specific to PySpark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27885) Announce deprecation of Python 2 support

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27885:
--
Summary: Announce deprecation of Python 2 support  (was: Update Spark 
website and put deprecation warning)

> Announce deprecation of Python 2 support
> 
>
> Key: SPARK-27885
> URL: https://issues.apache.org/jira/browse/SPARK-27885
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Update Spark website and announce deprecation of Python 2 support in the next 
> major release in 2019 and remove the support in a release after 2020/01/01.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27887) Check python version and print deprecation warning if version < 3

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27887:
--
Description: In Spark 3.0, users should see a deprecation warning if they 
use PySpark with Python < 3.

> Check python version and print deprecation warning if version < 3
> -
>
> Key: SPARK-27887
> URL: https://issues.apache.org/jira/browse/SPARK-27887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In Spark 3.0, users should see a deprecation warning if they use PySpark with 
> Python < 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27887) Check python version and print deprecation warning if version < 3

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27887:
-

 Summary: Check python version and print deprecation warning if 
version < 3
 Key: SPARK-27887
 URL: https://issues.apache.org/jira/browse/SPARK-27887
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27885) Update Spark website and put deprecation warning

2019-05-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-27885:
--
Description: Update Spark website and announce deprecation of Python 2 
support in the next major release in 2019 and remove the support in a release 
after 2020/01/01.

> Update Spark website and put deprecation warning
> 
>
> Key: SPARK-27885
> URL: https://issues.apache.org/jira/browse/SPARK-27885
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Update Spark website and announce deprecation of Python 2 support in the next 
> major release in 2019 and remove the support in a release after 2020/01/01.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27886) Add Apache Spark project to https://python3statement.org/

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27886:
-

 Summary: Add Apache Spark project to https://python3statement.org/
 Key: SPARK-27886
 URL: https://issues.apache.org/jira/browse/SPARK-27886
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Add Spark to https://python3statement.org/ and indicate our timeline.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27885) Update Spark website and put deprecation warning

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27885:
-

 Summary: Update Spark website and put deprecation warning
 Key: SPARK-27885
 URL: https://issues.apache.org/jira/browse/SPARK-27885
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-05-30 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-27884:
-

 Summary: Deprecate Python 2 support in Spark 3.0
 Key: SPARK-27884
 URL: https://issues.apache.org/jira/browse/SPARK-27884
 Project: Spark
  Issue Type: Story
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Xiangrui Meng


Officially deprecate Python 2 support in Spark 3.0 and discuss the timeline to 
drop Python 2 support in a future release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851979#comment-16851979
 ] 

Gabor Somogyi commented on SPARK-27742:
---

{quote}For the user this means better flexibility per job to setup an upper 
limit.{quote}

* new token will be obtained at "expiryTimestamp * 0.75"
* Old token will be invalidated at expiryTimestamp because not renewed

Not sure if the app developer has to control the MaxLifeTime.


> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-05-30 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851966#comment-16851966
 ] 

Gengliang Wang edited comment on SPARK-27798 at 5/30/19 3:07 PM:
-

Turning off the rule "ConvertToLocalRelation" should work around the problem:
spark.conf.set(“spark.sql.optimizer.excludedRules”, 
“org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”)



was (Author: gengliang.wang):
Turning off the rule "ConvertToLocalRelation" should fix the problem:
spark.conf.set(“spark.sql.optimizer.excludedRules”, 
“org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”)


> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27798) from_avro can modify variables in other rows in local mode

2019-05-30 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851966#comment-16851966
 ] 

Gengliang Wang commented on SPARK-27798:


Turning off the rule "ConvertToLocalRelation" should fix the problem:
spark.conf.set(“spark.sql.optimizer.excludedRules”, 
“org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”)


> from_avro can modify variables in other rows in local mode
> --
>
> Key: SPARK-27798
> URL: https://issues.apache.org/jira/browse/SPARK-27798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yosuke Mori
>Priority: Blocker
>  Labels: correctness
> Attachments: Screen Shot 2019-05-21 at 2.39.27 PM.png
>
>
> Steps to reproduce:
> Create a local Dataset (at least two distinct rows) with a binary Avro field. 
> Use the {{from_avro}} function to deserialize the binary into another column. 
> Verify that all of the rows incorrectly have the same value.
> Here's a concrete example (using Spark 2.4.3). All it does is converts a list 
> of TestPayload objects into binary using the defined avro schema, then tries 
> to deserialize using {{from_avro}} with that same schema:
> {code:java}
> import org.apache.avro.Schema
> import org.apache.avro.generic.{GenericDatumWriter, GenericRecord, 
> GenericRecordBuilder}
> import org.apache.avro.io.EncoderFactory
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.avro.from_avro
> import org.apache.spark.sql.functions.col
> import java.io.ByteArrayOutputStream
> object TestApp extends App {
>   // Payload container
>   case class TestEvent(payload: Array[Byte])
>   // Deserialized Payload
>   case class TestPayload(message: String)
>   // Schema for Payload
>   val simpleSchema =
> """
>   |{
>   |"type": "record",
>   |"name" : "Payload",
>   |"fields" : [ {"name" : "message", "type" : [ "string", "null" ] } ]
>   |}
> """.stripMargin
>   // Convert TestPayload into avro binary
>   def generateSimpleSchemaBinary(record: TestPayload, avsc: String): 
> Array[Byte] = {
> val schema = new Schema.Parser().parse(avsc)
> val out = new ByteArrayOutputStream()
> val writer = new GenericDatumWriter[GenericRecord](schema)
> val encoder = EncoderFactory.get().binaryEncoder(out, null)
> val rootRecord = new GenericRecordBuilder(schema).set("message", 
> record.message).build()
> writer.write(rootRecord, encoder)
> encoder.flush()
> out.toByteArray
>   }
>   val spark: SparkSession = 
> SparkSession.builder().master("local[*]").getOrCreate()
>   import spark.implicits._
>   List(
> TestPayload("one"),
> TestPayload("two"),
> TestPayload("three"),
> TestPayload("four")
>   ).map(payload => TestEvent(generateSimpleSchemaBinary(payload, 
> simpleSchema)))
> .toDS()
> .withColumn("deserializedPayload", from_avro(col("payload"), 
> simpleSchema))
> .show(truncate = false)
> }
> {code}
> And here is what this program outputs:
> {noformat}
> +--+---+
> |payload   |deserializedPayload|
> +--+---+
> |[00 06 6F 6E 65]  |[four] |
> |[00 06 74 77 6F]  |[four] |
> |[00 0A 74 68 72 65 65]|[four] |
> |[00 08 66 6F 75 72]   |[four] |
> +--+---+{noformat}
> Here, we can see that the avro binary is correctly generated, but the 
> deserialized version is a copy of the last row. I have not yet verified that 
> this is an issue in cluster mode as well.
>  
> I dug into a bit more of the code and it seems like the resuse of {{result}} 
> in {{AvroDataToCatalyst}} is overwriting the decoded values of previous rows. 
> I set a breakpoint in {{LocalRelation}} and the {{data}} sequence seem to all 
> point to the same address in memory - and therefore a mutation in one 
> variable will cause all of it to mutate.
> !Screen Shot 2019-05-21 at 2.39.27 PM.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27706) Add SQL metrics of numOutputRows for BroadcastExchangeExec

2019-05-30 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-27706.
-

> Add SQL metrics of numOutputRows for BroadcastExchangeExec
> --
>
> Key: SPARK-27706
> URL: https://issues.apache.org/jira/browse/SPARK-27706
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: dzcxzl
>Priority: Trivial
>
> Add SQL metrics of numOutputRows for BroadcastExchangeExec



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:58 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark batch apps and you want to setup limits for administration 
reasons eg. no user is allowed to have access more than 5 days (for streaming 
jobs you need no limits). Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits for administration reasons eg. 
no user is allowed to have access more than 5 days (for streaming jobs you need 
no limits). Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:57 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits for administration reasons eg. 
no user is allowed to have access more than 5 days (for streaming jobs you need 
no limits). Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits for administration reasons eg. 
no user is allowed to have access more than 5 days. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:55 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits for administration reasons eg. 
no user is allowed to have access more than 5 days. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:55 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user to have a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user renew a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:53 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user renew a ticket for ever, if not 
mistaken, which is less restrictive than setting a hard limit. Imagine you have 
multiple Spark apps and you want to setup limits. Anyway my 2 cents.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user renew a ticket for ever, if not 
mistaken, which is less restrictive than the setting a hard limit. Imagine I 
have multiple Spark apps and I want to setup limits. Anyway some thoughts.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:53 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time. From a 
security perspective this allows the user renew a ticket for ever, if not 
mistaken, which is less restrictive than the setting a hard limit. Imagine I 
have multiple Spark apps and I want to setup limits. Anyway some thoughts.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851952#comment-16851952
 ] 

Liang-Chi Hsieh commented on SPARK-27873:
-

I can prepare a PR if Marcin or Hyukjin Kwon don't plan to do.

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27873) Csv reader, adding a corrupt record column causes error if enforceSchema=false

2019-05-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851948#comment-16851948
 ] 

Liang-Chi Hsieh commented on SPARK-27873:
-

I guess what Marcin meant is:

{code}
val schema = StructType.fromDDL("a int, b date")
val columnNameOfCorruptRecord = "_unparsed"
val schemaWithCorrField1 = schema.add(columnNameOfCorruptRecord, StringType)
val df = spark
  .read
  .option("mode", "Permissive")
  .option("header", "true")
  .option("enforceSchema", false)
  .option("columnNameOfCorruptRecord", columnNameOfCorruptRecord)
  .schema(schemaWithCorrField1)
  .csv(testFile(valueMalformedWithHeaderFile))
{code}

If we want to keep corrupt record, we provide a new column into the schema. But 
this new column isn't in CSV header. So if enforceSchema is disable at the same 
time, CSVHeaderChecker throws a exception like:

{code}
[info]   Cause: java.lang.IllegalArgumentException: Number of column in CSV 
header is not equal to number of fields in the schema: 
[info]  Header length: 2, schema size: 3   
{code}

It is because CSVHeaderChecker doesn't consider columnNameOfCorruptRecord for 
now.

> Csv reader, adding a corrupt record column causes error if enforceSchema=false
> --
>
> Key: SPARK-27873
> URL: https://issues.apache.org/jira/browse/SPARK-27873
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcin Mejran
>Priority: Major
>
> In the Spark CSV reader If you're using permissive mode with a column for 
> storing corrupt records then you need to add a new schema column 
> corresponding to columnNameOfCorruptRecord.
> However, if you have a header row and enforceSchema=false the schema vs. 
> header validation fails because there is an extra column corresponding to 
> columnNameOfCorruptRecord.
> Since, the FAILFAST mode doesn't print informative error messages on which 
> rows failed to parse there is no way other to track down broken rows without 
> setting a corrupt record column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27757) Bump Jackson to 2.9.9

2019-05-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27757:
--
Priority: Minor  (was: Major)

> Bump Jackson to 2.9.9
> -
>
> Key: SPARK-27757
> URL: https://issues.apache.org/jira/browse/SPARK-27757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Minor
> Fix For: 3.0.0
>
>
> This fixes CVE-2019-12086 on Databind: 
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27757) Bump Jackson to 2.9.9

2019-05-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27757.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24646
[https://github.com/apache/spark/pull/24646]

> Bump Jackson to 2.9.9
> -
>
> Key: SPARK-27757
> URL: https://issues.apache.org/jira/browse/SPARK-27757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.0.0
>
>
> This fixes CVE-2019-12086 on Databind: 
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27757) Bump Jackson to 2.9.9

2019-05-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27757:
-

Assignee: Fokko Driesprong

> Bump Jackson to 2.9.9
> -
>
> Key: SPARK-27757
> URL: https://issues.apache.org/jira/browse/SPARK-27757
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>
> This fixes CVE-2019-12086 on Databind: 
> https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.9.9



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27883) Add aggregates.sql - Part2

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27883:


Assignee: (was: Apache Spark)

> Add aggregates.sql - Part2
> --
>
> Key: SPARK-27883
> URL: https://issues.apache.org/jira/browse/SPARK-27883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27883) Add aggregates.sql - Part2

2019-05-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27883:


Assignee: Apache Spark

> Add aggregates.sql - Part2
> --
>
> Key: SPARK-27883
> URL: https://issues.apache.org/jira/browse/SPARK-27883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA1/src/test/regress/sql/aggregates.sql#L145-L350



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:28 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens, no max life time.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:27 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job to setup an upper limit.

Again I see the point repeatedly get new tokens.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27742) Security Support in Sources and Sinks for SS and Batch

2019-05-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851924#comment-16851924
 ] 

Stavros Kontopoulos edited comment on SPARK-27742 at 5/30/19 2:24 PM:
--

{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."

For the user this means better flexibility per job.
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.


was (Author: skonto):
{quote}From client side the max lifetime can be only decreased for security 
reasons + see my previous point.
{quote}
In general since Kafka allows me to set this value, I should be able to do so. 

"Max lifetime for the token in milliseconds. If the value is -1, then 
MaxLifeTime will default to a server side config value 
(delegation.token.max.lifetime.ms)."
{quote}There is no possibility to obtain token for anybody else (pls see the 
comment in the code).
{quote}
When proxy user will be supported I guess there will be.

> Security Support in Sources and Sinks for SS and Batch
> --
>
> Key: SPARK-27742
> URL: https://issues.apache.org/jira/browse/SPARK-27742
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> As discussed with [~erikerlandson] on the [Big Data on K8s 
> UG|https://docs.google.com/document/d/1pnF38NF6N5eM8DlK088XUW85Vms4V2uTsGZvSp8MNIA]
>  it would be good to capture current status and identify work that needs to 
> be done for securing Spark when accessing sources and sinks. For example what 
> is the status of SSL, Kerberos support in different scenarios. The big 
> concern nowadays is how to secure data pipelines end-to-end. 
> Note: Not sure if this overlaps with some other ticket. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >