date:20210223

[jira] [Commented] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289745#comment-17289745
 ] 

Apache Spark commented on SPARK-31891:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31633

> `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if 
> partition specific location is not exist any more
> ---
>
> Key: SPARK-31891
> URL: https://issues.apache.org/jira/browse/SPARK-31891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently when execute
> {code:sql}
> ALTER TABLE multipartIdentifier RECOVER PARTITIONS
> {code}
> It will auto add the partition according to the table root location 
> structure. 
> And Spark need to add one more step to check if the existing partition 
> specific location exists and then delete the partition？
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289744#comment-17289744
 ] 

Apache Spark commented on SPARK-31891:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31633

> `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if 
> partition specific location is not exist any more
> ---
>
> Key: SPARK-31891
> URL: https://issues.apache.org/jira/browse/SPARK-31891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently when execute
> {code:sql}
> ALTER TABLE multipartIdentifier RECOVER PARTITIONS
> {code}
> It will auto add the partition according to the table root location 
> structure. 
> And Spark need to add one more step to check if the existing partition 
> specific location exists and then delete the partition？
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289728#comment-17289728
 ] 

Xiaochen Ouyang commented on SPARK-33212:
-

Thanks for your reply [~csun] !

Submmit command: spark-submit --master yarn --deploy-mode client --class 
org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark-examples*.jar

In ApplicationMaster.scala

/** Add the Yarn IP filter that is required for properly securing the UI. */
private def addAmIpFilter(driver: Option[RpcEndpointRef]) = {
 val proxyBase = 
System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV)
 {color:#de350b}val amFilter = 
"org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"{color}
 val params = client.getAmIpFilterParams(yarnConf, proxyBase)
 driver match {
 case Some(d) =>
 d.send(AddWebUIFilter(amFilter, params.toMap, proxyBase))

 case None =>
 System.setProperty("spark.ui.filters", amFilter)
 params.foreach \{ case (k, v) => 
System.setProperty(s"spark.$amFilter.param.$k", v) }
 }
}

We need load hadoop-yarn-server-web-proxy.jar into driver classloader when 
submitting a spark on yarn application . Do you mean that we should copy 
hadoop-yarn-server-web-proxy.jar to spark/jars ? 

 

1. AMIpFilter  ClassNotFoundException:

2021-02-24 14:52:56,617 INFO org.apache.spark.storage.BlockManager: Initialized 
BlockManager: BlockManagerId(driver, spark-worker-2, 38399, None)
2021-02-24 14:52:56,704 INFO 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend: Add WebUI 
Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, 
Map(PROXY_HOSTS -> spark-worker-1,spark-worker-2, PROXY_URI_BASES -> 
http://spark-worker-1:8088/proxy/application_1613961532167_0098,http://spark-worker-2:8088/proxy/application_1613961532167_0098,
 RM_HA_URLS -> spark-worker-1:8088,spark-worker-2:8088), 
/proxy/application_1613961532167_0098
2021-02-24 14:52:56,708 INFO org.apache.spark.ui.JettyUtils: Adding filter 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, 
/jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, 
/stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, 
/storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, 
/executors/json, /executors/threadDump, /executors/threadDump/json, /logLevel, 
/static, /, /api, /jobs/job/kill, /stages/stage/kill.
2021-02-24 14:52:56,722 WARN org.spark_project.jetty.servlet.BaseHolder:
java.lang.ClassNotFoundException: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at org.spark_project.jetty.util.Loader.loadClass(Loader.java:86)
 at org.spark_project.jetty.servlet.BaseHolder.doStart(BaseHolder.java:95)
 at org.spark_project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:92)
 at 
org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
 at 
org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:872)
 at 
org.spark_project.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1596)
 at 
org.spark_project.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1659)
 at 
org.spark_project.jetty.servlet.ServletHandler.addFilterMapping(ServletHandler.java:1297)
 at 
org.spark_project.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:1145)
 at 
org.spark_project.jetty.servlet.ServletContextHandler.addFilter(ServletContextHandler.java:448)
 at 
org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1$$anonfun$apply$1.apply(JettyUtils.scala:325)
 at 
org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1$$anonfun$apply$1.apply(JettyUtils.scala:294)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
 at 
org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1.apply(JettyUtils.scala:294)
 at 
org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1.apply(JettyUtils.scala:293)
 at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
 at org.apache.spark.ui.JettyUtils$.addFilters(JettyUtils.scala:293)
 at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$addWebUIFilter$3.apply(YarnSchedulerBackend.scala:176)
 at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$addWebUIFilter$3.apply(YarnSchedulerBackend.scala:176)
 at scala.Option.foreach(Option.scala:257)
 at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.org$apache$spark$scheduler$cluster$YarnSchedulerB

[jira] [Updated] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34515:
-
Affects Version/s: 3.1.2

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: ulysses you
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34152) CreateViewStatement.child should be a real child

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34152.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31273
[https://github.com/apache/spark/pull/31273]

> CreateViewStatement.child should be a real child
> 
>
> Key: SPARK-34152
> URL: https://issues.apache.org/jira/browse/SPARK-34152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Similar to `CreateTableAsSelectStatement`, the input query of 
> `CreateViewStatement` should be a child and get analyzed during the analysis 
> phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters

2021-02-23 Thread Yesheng Ma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yesheng Ma resolved SPARK-34414.

Resolution: Invalid

> OptimizeMetadataOnlyQuery should only apply for deterministic filters
> -
>
> Key: SPARK-34414
> URL: https://issues.apache.org/jira/browse/SPARK-34414
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Yesheng Ma
>Priority: Major
>
> Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only 
> apply for deterministic filters. If filters are non-deterministic, they have 
> to be evaluated against partitions separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289680#comment-17289680
 ] 

angerszhu edited comment on SPARK-34516 at 2/24/21, 6:34 AM:
-

For this error , I found some related issue:

[https://github.com/trinodb/trino/issues/2256] (Not so clear)

https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by 
parquet reader's logic)

https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed 
in parquet versison used by spark 3.0.1)

 

Check the parquet's code about this part, it just decode PageHeader from a data 
stream. 

Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon]   I am not sure if 
it is related to spark's vectorized reader of parquet. Can you take a look and 
give some advise?


was (Author: angerszhuuu):
For this error , I found some related issue:

[https://github.com/trinodb/trino/issues/2256] (Not so clear)

https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by 
parquet reader's logic)

https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed 
in parquet versison used by spark 3.0.1)

Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon]   I am not sure if 
it is related to spark's vectorized reader of parquet. Can you take a look and 
give some advise?

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289684#comment-17289684
 ] 

angerszhu edited comment on SPARK-34516 at 2/24/21, 6:32 AM:
-

[~maropu] Sort it out after desensitization and update later. 


was (Author: angerszhuuu):
[~maropu] Sort it out after desensitization and update later. 

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289684#comment-17289684
 ] 

angerszhu commented on SPARK-34516:
---

[~maropu] Sort it out after desensitization and update later. 

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289680#comment-17289680
 ] 

angerszhu commented on SPARK-34516:
---

For this error , I found some related issue:

[https://github.com/trinodb/trino/issues/2256] (Not so clear)

https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by 
parquet reader's logic)

https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed 
in parquet versison used by spark 3.0.1)

Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon]   I am not sure if 
it is related to spark's vectorized reader of parquet. Can you take a look and 
give some advise?

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289677#comment-17289677
 ] 

Takeshi Yamamuro commented on SPARK-34516:
--

What's a query to reproduce this issue?

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-23 Thread angerszhu (Jira)

angerszhu created SPARK-34516:
-

 Summary: Spark 3.0.1 encounter parquet PageHerder IO issue
 Key: SPARK-34516
 URL: https://issues.apache.org/jira/browse/SPARK-34516
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: angerszhu


{code:java}
Caused by: java.io.IOException: can not read class 
org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
was not found in serialized data! Struct: 
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
at 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34494) Move data source options from Python and Scala into a single page.

2021-02-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34494:
-
Affects Version/s: (was: 3.0.2)
   3.2.0

> Move data source options from Python and Scala into a single page.
> --
>
> Key: SPARK-34494
> URL: https://issues.apache.org/jira/browse/SPARK-34494
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-02-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34493:
-
Affects Version/s: (was: 3.0.2)
   3.2.0

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34492) Create "CSV Files" page for Data Source documents.

2021-02-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34492:
-
Affects Version/s: (was: 3.0.2)
   3.2.0

> Create "CSV Files" page for Data Source documents.
> --
>
> Key: SPARK-34492
> URL: https://issues.apache.org/jira/browse/SPARK-34492
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding "CSV Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34246) New type coercion syntax rules in ANSI mode

2021-02-23 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34246.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31349
[https://github.com/apache/spark/pull/31349]

> New type coercion syntax rules in ANSI mode
> ---
>
> Key: SPARK-34246
> URL: https://issues.apache.org/jira/browse/SPARK-34246
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new implicit cast syntax rules in ANSI mode.
> In Spark ANSI mode, the type coercion rules are based on the type precedence 
> lists of the input data types. 
> As per the section "Type precedence list determination" of "ISO/IEC 
> 9075-2:2011
> Information technology — Database languages - SQL — Part 2: Foundation 
> (SQL/Foundation)", the type precedence lists of primitive
>  data types are as following:
> * Byte: Byte, Short, Int, Long, Decimal, Float, Double
> * Short: Short, Int, Long, Decimal, Float, Double
> * Int: Int, Long, Decimal, Float, Double
> * Long: Long, Decimal, Float, Double
> * Decimal: Any wider Numeric type
> * Float: Float, Double
> * Double: Double
> * String: String
> * Date: Date, Timestamp
> * Timestamp: Timestamp
> * Binary: Binary
> * Boolean: Boolean
> * Interval: Interval
> As for complex data types, Spark will determine the precedent list 
> recursively based on their sub-types.
> With the definition of type precedent list, the general type coercion rules 
> are as following:
> * Data type S is allowed to be implicitly cast as type T iff T is in the 
> precedence list of S
> * Comparison is allowed iff the data type precedence list of both sides has 
> at least one common element. When evaluating the comparison, Spark casts both 
> sides as the tightest common data type of their precedent lists.
> * There should be at least one common data type among all the children's 
> precedence lists for the following operators. The data type of the operator 
> is the tightest common precedent data type.
> {code:java}
> In
> Except(odd)
> Intersect
> Greatest
> Least
> Union
> If
> CaseWhen
> CreateArray
> Array Concat
> Sequence
> MapConcat
> CreateMap
> {code}
> * For complex types (struct, array, map), Spark recursively looks into the 
> element type and applies the rules above. If the element nullability is 
> converted from true to false, add runtime null check to the elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-23 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34497:
-
Affects Version/s: 3.2.0

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0, 3.2.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-23 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34497:
-
Affects Version/s: (was: 3.1.0)
   3.1.2

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-23 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289669#comment-17289669
 ] 

Takeshi Yamamuro commented on SPARK-34497:
--

Please fill the description.

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289667#comment-17289667
 ] 

Apache Spark commented on SPARK-34515:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31632

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34515:


Assignee: (was: Apache Spark)

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34515:


Assignee: Apache Spark

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289665#comment-17289665
 ] 

Apache Spark commented on SPARK-34515:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31632

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34290) Support v2 TRUNCATE TABLE

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34290.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31605
[https://github.com/apache/spark/pull/31605]

> Support v2 TRUNCATE TABLE
> -
>
> Key: SPARK-34290
> URL: https://issues.apache.org/jira/browse/SPARK-34290
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Need to implement TRUNCATE TABLE for DSv2 tables similarly to v1 
> implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34290) Support v2 TRUNCATE TABLE

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34290:
---

Assignee: Maxim Gekk

> Support v2 TRUNCATE TABLE
> -
>
> Key: SPARK-34290
> URL: https://issues.apache.org/jira/browse/SPARK-34290
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Need to implement TRUNCATE TABLE for DSv2 tables similarly to v1 
> implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34245) Master may not remove the finished executor when Worker fails to send ExecutorStateChanged

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34245.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31348
[https://github.com/apache/spark/pull/31348]

> Master may not remove the finished executor when Worker fails to send 
> ExecutorStateChanged
> --
>
> Key: SPARK-34245
> URL: https://issues.apache.org/jira/browse/SPARK-34245
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.4.7, 3.0.1, 3.2.0, 3.1.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
> If the Worker fails to send ExecutorStateChanged to the Master due to some 
> errors, e.g., temporary network error, then the Master can't remove the 
> finished executor normally and think the executor is still alive. In the 
> worst case, if the executor is the only one executor for the application, the 
> application can get hang.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34245) Master may not remove the finished executor when Worker fails to send ExecutorStateChanged

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34245:
---

Assignee: wuyi

> Master may not remove the finished executor when Worker fails to send 
> ExecutorStateChanged
> --
>
> Key: SPARK-34245
> URL: https://issues.apache.org/jira/browse/SPARK-34245
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.4.7, 3.0.1, 3.2.0, 3.1.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> If the Worker fails to send ExecutorStateChanged to the Master due to some 
> errors, e.g., temporary network error, then the Master can't remove the 
> finished executor normally and think the executor is still alive. In the 
> worst case, if the executor is the only one executor for the application, the 
> application can get hang.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-34515:

Description: 
Spark will convert InSet to `>= and <=` if it's values size over 
`spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
partition . At this case, if values contain a null, we will get such exception 

 
{code:java}
java.lang.NullPointerException
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
 at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
 at java.util.TimSort.sort(TimSort.java:220)
 at java.util.Arrays.sort(Arrays.java:1438)
 at scala.collection.SeqLike.sorted(SeqLike.scala:659)
 at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
 at scala.collection.AbstractSeq.sorted(Seq.scala:45)
 at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
 at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
{code}

  was:
Spark will convert InSet to `>= and <=` if it's values size over 
`spark.sql.hive.metastorePartitionPruningInSetThreshold`. At this case, if 
values contain a null, we will get such exception 

 
{code:java}
java.lang.NullPointerException
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
 at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
 at java.util.TimSort.sort(TimSort.java:220)
 at java.util.Arrays.sort(Arrays.java:1438)
 at scala.collection.SeqLike.sorted(SeqLike.scala:659)
 at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
 at scala.collection.AbstractSeq.sorted(Seq.scala:45)
 at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
 at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
{code}


> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289653#comment-17289653
 ] 

Apache Spark commented on SPARK-33504:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31631

> The application log in the Spark history server contains sensitive attributes 
> such as password that should be redated instead of plain text
> ---
>
> Key: SPARK-33504
> URL: https://issues.apache.org/jira/browse/SPARK-33504
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: SparkListenerEnvironmentUpdate log shows ok.png, 
> SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png
>
>
> We found the secure attributes in SparkListenerJobStart and 
> SparkListenerStageSubmitted events would not been redated, resulting in 
> sensitive attributes can be viewd directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289654#comment-17289654
 ] 

Apache Spark commented on SPARK-33504:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31631

> The application log in the Spark history server contains sensitive attributes 
> such as password that should be redated instead of plain text
> ---
>
> Key: SPARK-33504
> URL: https://issues.apache.org/jira/browse/SPARK-33504
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
> Environment: Spark 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: SparkListenerEnvironmentUpdate log shows ok.png, 
> SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png
>
>
> We found the secure attributes in SparkListenerJobStart and 
> SparkListenerStageSubmitted events would not been redated, resulting in 
> sensitive attributes can be viewd directly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-23 Thread ulysses you (Jira)

ulysses you created SPARK-34515:
---

 Summary: Fix NPE if InSet contains null value during 
getPartitionsByFilter
 Key: SPARK-34515
 URL: https://issues.apache.org/jira/browse/SPARK-34515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


Spark will convert InSet to `>= and <=` if it's values size over 
`spark.sql.hive.metastorePartitionPruningInSetThreshold`. At this case, if 
values contain a null, we will get such exception 

 
{code:java}
java.lang.NullPointerException
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
 at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
 at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
 at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
 at java.util.TimSort.sort(TimSort.java:220)
 at java.util.Arrays.sort(Arrays.java:1438)
 at scala.collection.SeqLike.sorted(SeqLike.scala:659)
 at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
 at scala.collection.AbstractSeq.sorted(Seq.scala:45)
 at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
 at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
 at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the details [~ouyangxc.zte]!

{quote}
Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath
{quote}
This is interesting. the {{hadoop-client-minicluster.jar}} should only be used 
in tests - curious why it is needed here. Could you share stacktraces for the 
{{ClassNotFoundException}}?

{quote}
2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing 
SparkContext.
java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter
{quote}
Could you also share the stacktraces for this exception?

And to confirm, you are using {{client}} as the deploy mode, is that correct? 
I'll try to reproduce this in my local environment.


> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage

2021-02-23 Thread Ron Hu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289646#comment-17289646
 ] 

Ron Hu commented on SPARK-34488:


It should be noted that this Jira  addresses query parameter withSummaries for 
a specific stage.  In another Jira 
https://issues.apache.org/jira/browse/SPARK-26399, we address query parameter 
withSummaries for overall stages.

> Support task Metrics Distributions and executor Metrics Distributions in the 
> REST API call for a specified stage
> 
>
> Key: SPARK-34488
> URL: https://issues.apache.org/jira/browse/SPARK-34488
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ron Hu
>Priority: Major
> Attachments: executorMetricsDistributions.json, 
> taskMetricsDistributions.json
>
>
> For a specific stage, it is useful to show the task metrics in percentile 
> distribution.  This information can help users know whether or not there is a 
> skew/bottleneck among tasks in a given stage.  We list an example in 
> [^taskMetricsDistributions.json]
> Similarly, it is useful to show the executor metrics in percentile 
> distribution for a specific stage. This information can show whether or not 
> there is a skewed load on some executors.  We list an example in 
> [^executorMetricsDistributions.json]
>  
> We define withSummaries query parameter in the REST API for a specific stage 
> as:
> applications///?withSummaries=[true|false]
> When withSummaries=true, both task metrics in percentile distribution and 
> executor metrics in percentile distribution are included in the REST API 
> output.  The default value of withSummaries is false, i.e. no metrics 
> percentile distribution will be included in the REST API output.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34092) support filtering by task status in REST API call for a specific stage

2021-02-23 Thread Ron Hu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289644#comment-17289644
 ] 

Ron Hu commented on SPARK-34092:


It should be noted that this Jira  addresses query parameter taskStatus for a 
specific stage.  In another Jira 
https://issues.apache.org/jira/browse/SPARK-26399, we address query parameter 
taskStatus for overall stages.

> support filtering by task status in REST API call for a specific stage
> --
>
> Key: SPARK-34092
> URL: https://issues.apache.org/jira/browse/SPARK-34092
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Query parameter taskStatus can be used to filter the tasks meeting a specific 
> status in the REST API call for a given stage.   We want to support the 
> following REST API calls: 
> applications///stages/?details=true&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING]
> applications///stages//?details=true&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING]
> Need to set details=true in order to drill down to the task level within a 
> specified stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12

2021-02-23 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289638#comment-17289638
 ] 

Yang Jie commented on SPARK-34507:
--

It seems that the 
[https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom]
 is original pom.xml, not effective pom.xml

> Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
> -
>
> Key: SPARK-34507
> URL: https://issues.apache.org/jira/browse/SPARK-34507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Guillaume Martres
>Priority: Major
>
> Snapshots of Spark 3.2 built against Scala 2.13 are available at 
> [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,]
>  but they seem to depend on Scala 2.12. Specifically if I look at 
> [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom]
>  I see:
> {code:java}
> 2.12.10
> 2.13 It looks like 
> [https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65]
>  needs to be updated to also change the `scala.version` and not just the 
> `scala.binary.version`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12

2021-02-23 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289637#comment-17289637
 ] 

Yang Jie commented on SPARK-34507:
--

I think scala-2.13 profile should overrdie this property:
{code:java}

scala-2.13
2.13.4
2.13

...
  
{code}

> Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
> -
>
> Key: SPARK-34507
> URL: https://issues.apache.org/jira/browse/SPARK-34507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Guillaume Martres
>Priority: Major
>
> Snapshots of Spark 3.2 built against Scala 2.13 are available at 
> [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,]
>  but they seem to depend on Scala 2.12. Specifically if I look at 
> [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom]
>  I see:
> {code:java}
> 2.12.10
> 2.13 It looks like 
> [https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65]
>  needs to be updated to also change the `scala.version` and not just the 
> `scala.binary.version`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34514:


Assignee: Apache Spark

> Push down limit for LEFT SEMI and LEFT ANTI join
> 
>
> Key: SPARK-34514
> URL: https://issues.apache.org/jira/browse/SPARK-34514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> I found out during code review of 
> [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
>  [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where 
> we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if 
> the join condition is empty.
> Why it's safe to push down limit:
> The semantics of LEFT SEMI join without condition:
> (1). if right side is non-empty, output all rows from left side.
> (2). if right side is empty, output nothing.
>  
> The semantics of LEFT ANTI join without condition:
> (1). if right side is non-empty, output nothing.
> (2). if right side is empty, output all rows from left side.
>  
> With the semantics of output all rows from left side or nothing (all or 
> nothing), it's safe to push down limit to left side.
> NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for 
> limit push down, because output can be a portion of left side rows.
>  
> Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
>  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289625#comment-17289625
 ] 

Apache Spark commented on SPARK-34514:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/31630

> Push down limit for LEFT SEMI and LEFT ANTI join
> 
>
> Key: SPARK-34514
> URL: https://issues.apache.org/jira/browse/SPARK-34514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> I found out during code review of 
> [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
>  [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where 
> we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if 
> the join condition is empty.
> Why it's safe to push down limit:
> The semantics of LEFT SEMI join without condition:
> (1). if right side is non-empty, output all rows from left side.
> (2). if right side is empty, output nothing.
>  
> The semantics of LEFT ANTI join without condition:
> (1). if right side is non-empty, output nothing.
> (2). if right side is empty, output all rows from left side.
>  
> With the semantics of output all rows from left side or nothing (all or 
> nothing), it's safe to push down limit to left side.
> NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for 
> limit push down, because output can be a portion of left side rows.
>  
> Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
>  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34514:


Assignee: Apache Spark

> Push down limit for LEFT SEMI and LEFT ANTI join
> 
>
> Key: SPARK-34514
> URL: https://issues.apache.org/jira/browse/SPARK-34514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> I found out during code review of 
> [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
>  [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where 
> we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if 
> the join condition is empty.
> Why it's safe to push down limit:
> The semantics of LEFT SEMI join without condition:
> (1). if right side is non-empty, output all rows from left side.
> (2). if right side is empty, output nothing.
>  
> The semantics of LEFT ANTI join without condition:
> (1). if right side is non-empty, output nothing.
> (2). if right side is empty, output all rows from left side.
>  
> With the semantics of output all rows from left side or nothing (all or 
> nothing), it's safe to push down limit to left side.
> NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for 
> limit push down, because output can be a portion of left side rows.
>  
> Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
>  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34514:


Assignee: (was: Apache Spark)

> Push down limit for LEFT SEMI and LEFT ANTI join
> 
>
> Key: SPARK-34514
> URL: https://issues.apache.org/jira/browse/SPARK-34514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> I found out during code review of 
> [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
>  [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where 
> we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if 
> the join condition is empty.
> Why it's safe to push down limit:
> The semantics of LEFT SEMI join without condition:
> (1). if right side is non-empty, output all rows from left side.
> (2). if right side is empty, output nothing.
>  
> The semantics of LEFT ANTI join without condition:
> (1). if right side is non-empty, output nothing.
> (2). if right side is empty, output all rows from left side.
>  
> With the semantics of output all rows from left side or nothing (all or 
> nothing), it's safe to push down limit to left side.
> NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for 
> limit push down, because output can be a portion of left side rows.
>  
> Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
>  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618
 ] 

Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:33 AM:
---

Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1. Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}

2. Rebuild spark project 、depoly binary jars and submit application

3. Get a new Exception as follows:

+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver process.

 

 


was (Author: ouyangxc.zte):
Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}

2、Rebuild spark project 、depoly binary jars and submit application

3、Get a new Exception as follows:

+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver process.

 

 

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618
 ] 

Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:29 AM:
---

Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}

2、Rebuild spark project 、depoly binary jars and submit application

3、Get a new Exception as follows:

+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver process.

 

 


was (Author: ouyangxc.zte):
Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}

2、rebuild spark project 、depoly binary jars and submit application

3、Get a new Exception as follows:

+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver process.

 

 

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34514:
-
Description: 
I found out during code review of 
[https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
 [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we 
can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the 
join condition is empty.

Why it's safe to push down limit:

The semantics of LEFT SEMI join without condition:

(1). if right side is non-empty, output all rows from left side.

(2). if right side is empty, output nothing.

 

The semantics of LEFT ANTI join without condition:

(1). if right side is non-empty, output nothing.

(2). if right side is empty, output all rows from left side.

 

With the semantics of output all rows from left side or nothing (all or 
nothing), it's safe to push down limit to left side.

NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit 
push down, because output can be a portion of left side rows.

 

Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
 .

  was:
I found out during code review of 
[https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
 [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we 
can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the 
join condition is empty.

Why it's safe to push down limit:

The semantics of LEFT SEMI join without condition:

(1). if right side is non-empty, output all rows from left side.

(2). if right side is empty, output nothing.

 

The semantics of LEFT ANTI join without condition:

(1). if right side is non-empty, output nothing.

(2). if right side is empty, output all rows from left side.

 

With the semantics of output all rows from left side or nothing (all or 
nothing), it's safe to push down limit to left side.

NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit 
push down, because output can be a portion of left side rows.


> Push down limit for LEFT SEMI and LEFT ANTI join
> 
>
> Key: SPARK-34514
> URL: https://issues.apache.org/jira/browse/SPARK-34514
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> I found out during code review of 
> [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
>  [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where 
> we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if 
> the join condition is empty.
> Why it's safe to push down limit:
> The semantics of LEFT SEMI join without condition:
> (1). if right side is non-empty, output all rows from left side.
> (2). if right side is empty, output nothing.
>  
> The semantics of LEFT ANTI join without condition:
> (1). if right side is non-empty, output nothing.
> (2). if right side is empty, output all rows from left side.
>  
> With the semantics of output all rows from left side or nothing (all or 
> nothing), it's safe to push down limit to left side.
> NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for 
> limit push down, because output can be a portion of left side rows.
>  
> Physical operator for LEFT SEMI / LEFT ANTI join without condition - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204]
>  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618
 ] 

Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:28 AM:
---

Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}

2、rebuild spark project 、depoly binary jars and submit application

3、Get a new Exception as follows:

+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver process.

 

 


was (Author: ouyangxc.zte):
Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}
 
 2、rebuild spark project 、depoly binary jars and submit application
 
 3、Get a new Exception as follows:
 
 +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver proceess.

 

 

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618
 ] 

Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:28 AM:
---

Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}
 
 2、rebuild spark project 、depoly binary jars and submit application
 
 3、Get a new Exception as follows:
 
 +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error 
initializing SparkContext.
 java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import 
+{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, 
AmIpFilter can't be reflected in spark dirver proceess.

 

 


was (Author: ouyangxc.zte):
Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ {color:#172b4d}in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}{color}

{color:#de350b}{color:#172b4d}2、rebuild spark project 、depoly binary jars and 
submit application{color}{color}

{color:#de350b}{color:#172b4d}3、Get a new Exception as follows:{color}{color}

{color:#de350b}+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: 
Error initializing SparkContext.
java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+{color}

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import org.apache.hadoop.shaded.javax.servlet.Filter'. So, AmIpFilter 
can't be reflected in spark dirver proceess.

 

 

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Xiaochen Ouyang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618
 ] 

Xiaochen Ouyang commented on SPARK-33212:
-

Hi [~csun], we submit a spark application with command `spark-submit  --master 
yarn --class org.apache.spark.examples.SparkPi  
/opt/spark/examples/jars/spark*.jar`.

1、Get AMIpFilter ClassNotFoundException , because there is no 
'hadoop-client-minicluster.jar' in classpath. So we remove the line 
{color:#de350b}_'test'_ {color:#172b4d}in parent pom.xml and 
resource-manager/yarn/pom.xml.{color}{color}

{color:#de350b}{color:#172b4d}2、rebuild spark project 、depoly binary jars and 
submit application{color}{color}

{color:#de350b}{color:#172b4d}3、Get a new Exception as follows:{color}{color}

{color:#de350b}+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: 
Error initializing SparkContext.
java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
javax.servlet.Filter+{color}

 

The key reason is that spark dirver classloader load class `AmIpFilter` 
implements javax.servlet.Filter, but in shaded jar the class `Filter` imported 
like 'import org.apache.hadoop.shaded.javax.servlet.Filter'. So, AmIpFilter 
can't be reflected in spark dirver proceess.

 

 

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join

2021-02-23 Thread Cheng Su (Jira)

Cheng Su created SPARK-34514:


 Summary: Push down limit for LEFT SEMI and LEFT ANTI join
 Key: SPARK-34514
 URL: https://issues.apache.org/jira/browse/SPARK-34514
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Cheng Su


I found out during code review of 
[https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,](
 [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we 
can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the 
join condition is empty.

Why it's safe to push down limit:

The semantics of LEFT SEMI join without condition:

(1). if right side is non-empty, output all rows from left side.

(2). if right side is empty, output nothing.

 

The semantics of LEFT ANTI join without condition:

(1). if right side is non-empty, output nothing.

(2). if right side is empty, output all rows from left side.

 

With the semantics of output all rows from left side or nothing (all or 
nothing), it's safe to push down limit to left side.

NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit 
push down, because output can be a portion of left side rows.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32703:


Assignee: Apache Spark

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Minor
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32703:


Assignee: (was: Apache Spark)

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2021-02-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32703:
-
Fix Version/s: (was: 3.2.0)

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2021-02-23 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32703:
--
  Assignee: (was: Chao Sun)

Reverted at 
https://github.com/apache/spark/commit/80bad086c806fd507b1fb197b171f87333f2fb08

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34513) Kubernetes Spark Driver Pod Name Length Limitation

2021-02-23 Thread John (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John updated SPARK-34513:
-
Description: 
Hi,

We are using Spark in Airflow with the k8s-master. Airflow is attaching to our 
spark-driver pod a unique id utilizing the k8s-subdomain convention '.'

This creates rather long pod-names. 

We noticed an issue with pod names in total (pod name + airflow attached uuid) 
exceeding 63 chars. Usually pod names can be up to 253 chars long. However 
Spark seems to have an issue with driver pod names which are longer than 63 
characters.

In our case the driver pod name is exactly 65 chars long, but Spark is omitting 
the last 2 chars in its error message. I assume internally Spark is loosing 
those two characters. Reducing our Driver Pod Name to just 63 charts fixed the 
issue.

Here you can see the actual pod name (row 1) and the pod name from the Spark 
Error log (row 2)
{code:java}
ab-aa--cc-dd.3s092032c69f4639adff835a826e0120
ab-aa--cc-dd.3s092032c69f4639adff835a826e01{code}
{code:java}
[2021-02-20 00:30:06,289] {pod_launcher.py:136} INFO - Exception in thread 
"main" org.apache.spark.SparkException: No pod was found named 
Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the 
cluster in the namespace airflow-ns (this was supposed to be the driver 
pod.).{code}
 

  was:
Hi,

We are using Spark in Airflow with the k8s-master. Airflow is attaching to our 
spark-driver pod a unique id utilizing the k8s-subdomain convention '.'

This creates rather long pod-names. 

We noticed an issue with pod names in total (pod name + airflow attached uuid) 
exceeding 63 chars. Usually pod names can be up to 253 chars long. However 
Spark seems to have an issue with driver pod names which are longer than 63 
characters.

In our case the driver pod name is exactly 65 chars long, but Spark is omitting 
the last 2 chars in its error message. I assume internally Spark is loosing 
those two characters. Reducing our Driver Pod Name to just 63 charts fixed the 
issue.

Here you can see the actual pod name (row 1) and the pod name from the Spark 
Error log (row 2)
ab-aa--cc-dd.3s092032c69f4639adff835a826e0120
ab-aa--cc-dd.3s092032c69f4639adff835a826e01
[2021-02-20 00:30:06,289] \{pod_launcher.py:136} INFO - Exception in thread 
"main" org.apache.spark.SparkException: No pod was found named 
Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the 
cluster in the namespace airflow-ns (this was supposed to be the driver pod.).
 


> Kubernetes Spark Driver Pod Name Length Limitation
> --
>
> Key: SPARK-34513
> URL: https://issues.apache.org/jira/browse/SPARK-34513
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1
>Reporter: John
>Priority: Major
>
> Hi,
> We are using Spark in Airflow with the k8s-master. Airflow is attaching to 
> our spark-driver pod a unique id utilizing the k8s-subdomain convention '.'
> This creates rather long pod-names. 
> We noticed an issue with pod names in total (pod name + airflow attached 
> uuid) exceeding 63 chars. Usually pod names can be up to 253 chars long. 
> However Spark seems to have an issue with driver pod names which are longer 
> than 63 characters.
> In our case the driver pod name is exactly 65 chars long, but Spark is 
> omitting the last 2 chars in its error message. I assume internally Spark is 
> loosing those two characters. Reducing our Driver Pod Name to just 63 charts 
> fixed the issue.
> Here you can see the actual pod name (row 1) and the pod name from the Spark 
> Error log (row 2)
> {code:java}
> ab-aa--cc-dd.3s092032c69f4639adff835a826e0120
> ab-aa--cc-dd.3s092032c69f4639adff835a826e01{code}
> {code:java}
> [2021-02-20 00:30:06,289] {pod_launcher.py:136} INFO - Exception in thread 
> "main" org.apache.spark.SparkException: No pod was found named 
> Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the 
> cluster in the namespace airflow-ns (this was supposed to be the driver 
> pod.).{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34513) Kubernetes Spark Driver Pod Name Length Limitation

2021-02-23 Thread John (Jira)

John created SPARK-34513:


 Summary: Kubernetes Spark Driver Pod Name Length Limitation
 Key: SPARK-34513
 URL: https://issues.apache.org/jira/browse/SPARK-34513
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0
Reporter: John


Hi,

We are using Spark in Airflow with the k8s-master. Airflow is attaching to our 
spark-driver pod a unique id utilizing the k8s-subdomain convention '.'

This creates rather long pod-names. 

We noticed an issue with pod names in total (pod name + airflow attached uuid) 
exceeding 63 chars. Usually pod names can be up to 253 chars long. However 
Spark seems to have an issue with driver pod names which are longer than 63 
characters.

In our case the driver pod name is exactly 65 chars long, but Spark is omitting 
the last 2 chars in its error message. I assume internally Spark is loosing 
those two characters. Reducing our Driver Pod Name to just 63 charts fixed the 
issue.

Here you can see the actual pod name (row 1) and the pod name from the Spark 
Error log (row 2)
ab-aa--cc-dd.3s092032c69f4639adff835a826e0120
ab-aa--cc-dd.3s092032c69f4639adff835a826e01
[2021-02-20 00:30:06,289] \{pod_launcher.py:136} INFO - Exception in thread 
"main" org.apache.spark.SparkException: No pod was found named 
Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the 
cluster in the namespace airflow-ns (this was supposed to be the driver pod.).
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25390) Data source V2 API refactoring

2021-02-23 Thread Rafael (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289577#comment-17289577
 ] 

Rafael commented on SPARK-25390:


Sorry for late response, 
I was able to migrate my project on Spark 3.0.0
Here some hints what I did: 
https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581



> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2021-02-23 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289542#comment-17289542
 ] 

shane knapp commented on SPARK-33044:
-

hey!  sorry, i've been pretty slammed these past few weeks.  i should be
able to get this done by EOW.

On Mon, Oct 19, 2020 at 1:36 PM Dongjoon Hyun (Jira) 



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot 
> 2020-12-08 at 1.58.07 PM.png
>
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2021-02-23 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289527#comment-17289527
 ] 

shane knapp commented on SPARK-33044:
-

1) logins are temporarily disabled due to new campus network security
standards.   i need to find a non-manual-way of dealing with this asap.

2) i will get to this tomorrow.




-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot 
> 2020-12-08 at 1.58.07 PM.png
>
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26138) Pushdown limit through InnerLike when condition is empty

2021-02-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-26138.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31567
[https://github.com/apache/spark/pull/31567]

> Pushdown limit through InnerLike when condition is empty
> 
>
> Key: SPARK-26138
> URL: https://issues.apache.org/jira/browse/SPARK-26138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: guoxiaolong
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.2.0
>
>
> In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32195) Standardize warning types and messages

2021-02-23 Thread RISHAV DUTTA (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289505#comment-17289505
 ] 

RISHAV DUTTA commented on SPARK-32195:
--

I am working on it. Can you takeup another issue?

Thanks and regards

Rishav Dutta





> Standardize warning types and messages
> --
>
> Key: SPARK-32195
> URL: https://issues.apache.org/jira/browse/SPARK-32195
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently PySpark uses a somewhat inconsistent warning type and message such 
> as UserWarning. We should standardize it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26138) Pushdown limit through InnerLike when condition is empty

2021-02-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-26138:
---

Assignee: Yuming Wang

> Pushdown limit through InnerLike when condition is empty
> 
>
> Key: SPARK-26138
> URL: https://issues.apache.org/jira/browse/SPARK-26138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: guoxiaolong
>Assignee: Yuming Wang
>Priority: Minor
>
> In LimitPushDown batch, cross join can push down the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13

2021-02-23 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289470#comment-17289470
 ] 

shane knapp commented on SPARK-33044:
-

welp, i manually created a test job and it failed pretty early on:
https://amplab.cs.berkeley.edu/jenkins/view/All/job/spark-master-test-maven-hadoop-3.2-hive-2.3-scala-2.13/1/

On Tue, Dec 8, 2020 at 10:59 AM Dongjoon Hyun (Jira) 



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


> Add a Jenkins build and test job for Scala 2.13
> ---
>
> Key: SPARK-33044
> URL: https://issues.apache.org/jira/browse/SPARK-33044
> Project: Spark
>  Issue Type: Sub-task
>  Components: jenkins
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot 
> 2020-12-08 at 1.58.07 PM.png
>
>
> {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a 
> Jenkins test job to verify current work results and CI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34503) Use zstd for spark.eventLog.compression.codec by default

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34503.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31618
[https://github.com/apache/spark/pull/31618]

> Use zstd for spark.eventLog.compression.codec by default
> 
>
> Key: SPARK-34503
> URL: https://issues.apache.org/jira/browse/SPARK-34503
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34503) Use zstd for spark.eventLog.compression.codec by default

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34503:
-

Assignee: Dongjoon Hyun

> Use zstd for spark.eventLog.compression.codec by default
> 
>
> Key: SPARK-34503
> URL: https://issues.apache.org/jira/browse/SPARK-34503
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34512) Disable validate default values when parsing Avro schemas

2021-02-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34512:

Description: 
This is a regression problem. How to reproduce this issue:
{code:scala}
  // Add this test to HiveSerDeReadWriteSuite
  test("SPARK-34512") {
withTable("t1") {
  hiveClient.runSqlHive(
"""
  |CREATE TABLE t1
  |  ROW FORMAT SERDE
  |  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  |  STORED AS INPUTFORMAT
  |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  |  OUTPUTFORMAT
  |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  |  TBLPROPERTIES (
  |'avro.schema.literal'='{
  |  "namespace": "org.apache.spark.sql.hive.test",
  |  "name": "schema_with_default_value",
  |  "type": "record",
  |  "fields": [
  | {
  |   "name": "ARRAY_WITH_DEFAULT",
  |   "type": {"type": "array", "items": "string"},
  |   "default": null
  | }
  |   ]
  |}')
  |""".stripMargin)

  spark.sql("select * from t1").show
}
  }
{code}


{noformat}
org.apache.avro.AvroTypeException: Invalid default for field 
ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"}
at org.apache.avro.Schema.validateDefault(Schema.java:1571)
at org.apache.avro.Schema.access$500(Schema.java:87)
at org.apache.avro.Schema$Field.(Schema.java:544)
at org.apache.avro.Schema.parse(Schema.java:1678)
at org.apache.avro.Schema$Parser.parse(Schema.java:1425)
at org.apache.avro.Schema$Parser.parse(Schema.java:1413)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)
at 
org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787)

{noformat}


  was:
How to reproduce this issue:
{code:scala}
  // Add this test to HiveSerDeReadWriteSuite
  test("SPARK-34512") {
withTable("t1") {
  hiveClient.runSqlHive(
"""
  |CREATE TABLE t1
  |  ROW FORMAT SERDE
  |  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  |  STORED AS INPUTFORMAT
  |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  |  OUTPUTFORMAT
  |  'org.apache.hadoop.hive.ql

[jira] [Updated] (SPARK-34512) Disable validate default values when parsing Avro schemas

2021-02-23 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34512:

Description: 
How to reproduce this issue:
{code:scala}
  // Add this test to HiveSerDeReadWriteSuite
  test("SPARK-34512") {
withTable("t1") {
  hiveClient.runSqlHive(
"""
  |CREATE TABLE t1
  |  ROW FORMAT SERDE
  |  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  |  STORED AS INPUTFORMAT
  |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  |  OUTPUTFORMAT
  |  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  |  TBLPROPERTIES (
  |'avro.schema.literal'='{
  |  "namespace": "org.apache.spark.sql.hive.test",
  |  "name": "schema_with_default_value",
  |  "type": "record",
  |  "fields": [
  | {
  |   "name": "ARRAY_WITH_DEFAULT",
  |   "type": {"type": "array", "items": "string"},
  |   "default": null
  | }
  |   ]
  |}')
  |""".stripMargin)

  spark.sql("select * from t1").show
}
  }
{code}


{noformat}
org.apache.avro.AvroTypeException: Invalid default for field 
ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"}
at org.apache.avro.Schema.validateDefault(Schema.java:1571)
at org.apache.avro.Schema.access$500(Schema.java:87)
at org.apache.avro.Schema$Field.(Schema.java:544)
at org.apache.avro.Schema.parse(Schema.java:1678)
at org.apache.avro.Schema$Parser.parse(Schema.java:1425)
at org.apache.avro.Schema$Parser.parse(Schema.java:1413)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263)
at 
org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831)
at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867)
at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199)
at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787)

{noformat}

It works before.


  was:
How to reproduce this issue:
{code:scala}

{code}


> Disable validate default values when parsing Avro schemas
> -
>
> Key: SPARK-34512
> URL: https://issues.apache.org/jira/browse/SPARK-34512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce

[jira] [Created] (SPARK-34512) Disable validate default values when parsing Avro schemas

2021-02-23 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-34512:
---

 Summary: Disable validate default values when parsing Avro schemas
 Key: SPARK-34512
 URL: https://issues.apache.org/jira/browse/SPARK-34512
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


How to reproduce this issue:
{code:scala}

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31891:
-

Assignee: Maxim Gekk

> `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if 
> partition specific location is not exist any more
> ---
>
> Key: SPARK-31891
> URL: https://issues.apache.org/jira/browse/SPARK-31891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently when execute
> {code:sql}
> ALTER TABLE multipartIdentifier RECOVER PARTITIONS
> {code}
> It will auto add the partition according to the table root location 
> structure. 
> And Spark need to add one more step to check if the existing partition 
> specific location exists and then delete the partition？
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31891.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31499
[https://github.com/apache/spark/pull/31499]

> `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if 
> partition specific location is not exist any more
> ---
>
> Key: SPARK-31891
> URL: https://issues.apache.org/jira/browse/SPARK-31891
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently when execute
> {code:sql}
> ALTER TABLE multipartIdentifier RECOVER PARTITIONS
> {code}
> It will auto add the partition according to the table root location 
> structure. 
> And Spark need to add one more step to check if the existing partition 
> specific location exists and then delete the partition？
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34508.
---
Fix Version/s: 3.2.0
   3.1.1
   Resolution: Fixed

Issue resolved by pull request 31627
[https://github.com/apache/spark/pull/31627]

> skip HiveExternalCatalogVersionsSuite if network is down
> 
>
> Key: SPARK-34508
> URL: https://issues.apache.org/jira/browse/SPARK-34508
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.1, 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34508:
-

Assignee: Wenchen Fan

> skip HiveExternalCatalogVersionsSuite if network is down
> 
>
> Key: SPARK-34508
> URL: https://issues.apache.org/jira/browse/SPARK-34508
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34511) Current Security vulnerabilities in spark libraries

2021-02-23 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289339#comment-17289339
 ] 

Dongjoon Hyun commented on SPARK-34511:
---

Hi, [~emac3060] . This is outdated already. Please see 3.0.2 release note. For 
example, we are using cocoons-compress 1.20 and Jetty 9.4.34.

- [https://spark.apache.org/releases/spark-release-3-0-2.html]

Could you update this report based on 3.0.2 or 3.1.0 RC3?

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>  
> com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> Log4j : log4j : 1.2.17
> SocketServer class that is vulnerable to deserialization of untrusted data: * 
> https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
> Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685
>  * 
> [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176]
>  
> com.fasterxml.jackson.core : jackson-databind : 2.10.0 * 
> [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> io.netty : netty-all : 4.1.47.Final * 
> [https://github.com/netty/netty/issues/10351]
>  * [https://github.com/netty/netty/pull/10560]
>  
> org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]
>  * 
> [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]
>  * [https://hadoop.apache.org/cve_list.html]
>  * [https://www.openwall.com/lists/oss-security/2019/01/24/3]
>  
> org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]
>  * 
> [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]
>  
> org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921]
>  * [https://github.com/eclipse/jetty.project/issues/5451]
>  * 
> [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (SPARK-34511) Current Security vulnerabilities in spark libraries

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34511:
--
Flags:   (was: Patch)

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>  
> com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> Log4j : log4j : 1.2.17
> SocketServer class that is vulnerable to deserialization of untrusted data: * 
> https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
> Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685
>  * 
> [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176]
>  
> com.fasterxml.jackson.core : jackson-databind : 2.10.0 * 
> [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> io.netty : netty-all : 4.1.47.Final * 
> [https://github.com/netty/netty/issues/10351]
>  * [https://github.com/netty/netty/pull/10560]
>  
> org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]
>  * 
> [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]
>  * [https://hadoop.apache.org/cve_list.html]
>  * [https://www.openwall.com/lists/oss-security/2019/01/24/3]
>  
> org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]
>  * 
> [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]
>  
> org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921]
>  * [https://github.com/eclipse/jetty.project/issues/5451]
>  * 
> [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34511) Current Security vulnerabilities in spark libraries

2021-02-23 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34511:
--
Component/s: (was: Spark Core)
 Build

> Current Security vulnerabilities in spark libraries
> ---
>
> Key: SPARK-34511
> URL: https://issues.apache.org/jira/browse/SPARK-34511
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.1
>Reporter: eoin
>Priority: Major
>  Labels: security
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The following libraries have the following vulnerabilities that will fail 
> Nexus security scans. They are deemed as threats of level 7 and higher on the 
> Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
> as the are fixed in subsequent releases.
>  
> com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
> [https://github.com/FasterXML/woodstox/issues/50]
>  * [https://github.com/FasterXML/woodstox/issues/51]
>  * [https://github.com/FasterXML/woodstox/issues/61]
> com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
> [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
>  * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]
> Log4j : log4j : 1.2.17
> SocketServer class that is vulnerable to deserialization of untrusted data: * 
> https://issues.apache.org/jira/browse/LOG4J2-1863
>  * 
> [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]
> Dynamic-link Library (DLL) Preloading:
>  * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]
>  
> apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
> https://issues.apache.org/jira/browse/XERCESJ-1685
>  * 
> [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176]
>  
> com.fasterxml.jackson.core : jackson-databind : 2.10.0 * 
> [https://github.com/FasterXML/jackson-databind/issues/2589]
>  
> commons-beanutils : commons-beanutils : 1.9.3 * 
> [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
>  * https://issues.apache.org/jira/browse/BEANUTILS-463
>  
> commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52]
>  * https://issues.apache.org/jira/browse/IO-556
>  * https://issues.apache.org/jira/browse/IO-559
>  
> io.netty : netty-all : 4.1.47.Final * 
> [https://github.com/netty/netty/issues/10351]
>  * [https://github.com/netty/netty/pull/10560]
>  
> org.apache.commons : commons-compress : 1.18 * 
> [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]
>  
> org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
> [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]
>  * 
> [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]
>  * [https://hadoop.apache.org/cve_list.html]
>  * [https://www.openwall.com/lists/oss-security/2019/01/24/3]
>  
> org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
> [https://bugzilla.redhat.com/show_bug.cgi?id=1516399]
>  * 
> [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]
>  
> org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
> [https://github.com/FasterXML/jackson-databind/issues/1599]
>  * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
>  * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
>  * [https://access.redhat.com/security/cve/cve-2019-10172]
>  * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
>  * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]
>  
> org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]
>  
> org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * 
> [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921]
>  * [https://github.com/eclipse/jetty.project/issues/5451]
>  * 
> [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Description: 
I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is 
what controls the flow of the application and calls code inside the 
_file_processor_ package. The command hangs when the .foreachPartition code 
that is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 *table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}

  was:
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
_file_processor_ package. The command hangs when the .foreachPartition code 
that is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 *table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is 
> what controls the flow of the application and calls code inside the 
> _file_processor_ package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from py

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Description: 
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
_file_processor_ package. The command hangs when the .foreachPartition code 
that is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 *table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}

  was:
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package. The command hangs when the .foreachPartition code that 
is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 *table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py 
> is what controls the flow of the application and calls code inside the 
> _file_processor_ package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Description: 
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package. The command hangs when the .foreachPartition code that 
is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 *table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}

  was:
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package. The command hangs when the .foreachPartition code that 
is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 

*table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py 
> is what controls the flow of the application and calls code inside the 
> file_processor package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from py

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Description: 
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package. The command hangs when the .foreachPartition code that 
is located inside _s3_repo.py_ is called by _process.py_. When the same 
.foreachPartition code is moved from _s3_repo.py_ and placed inside the 
_process.py_ it runs just fine.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
*process.py*
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
*spark.py*
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
*s3_repo.py* 
{code:java}
from file_processor.config.spark import spark_session

def save_to_s3():
spark_session.sql('SELECT * FROM 
rawFileData').toJSON().foreachPartition(_save_to_s3)

def _save_to_s3(iterator):   
for record in iterator:
print(record)
{code}
 

*table_creator.py*
{code:java}
from file_processor.config.spark import spark_session
from pyspark.sql import Row

def create_table():
file_contents = [
{'line_num': 1, 'contents': 'line 1'},
{'line_num': 2, 'contents': 'line 2'},
{'line_num': 3, 'contents': 'line 3'}
]
spark_session.createDataFrame(Row(**row) for row in 
file_contents).cache().createOrReplaceTempView("rawFileData")
{code}

  was:
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
The command hangs when the .foreachPartition code that is located inside 
s3_repo.py is called by process.py. When the same .foreachPartition code is 
moved from s3_repo.py and placed inside the process.py it runs just fine.

 

process.py
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
spark.py
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
s3_repo.py

 

 


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py 
> is what controls the flow of the application and calls code inside the 
> file_processor package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> *s3_repo.py* 
> {code:java}
> from file_processor.config.spark import spark_session
> def save_to_s3():
> spark_session.sql('SELECT * FROM 
> rawFileData').toJSON().foreachPartition(_save_to_s3)
> def _save_to_s3(iterator):   
> for record in iterator:
> print(record)
> {code}
>  
> *table_creator.py*
> {code:java}
> from file_processor.config.spark import spark_session
> from pyspark.sql import Row
> def create_table():
> file_contents = [
> {'line_num': 1, 'contents': 'line 1'},
> {'line_num': 2, 'contents': 'line 2'},
> {'line_num': 3, 'content

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Description: 
I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is 
what controls the flow of the application and calls code inside the 
file_processor package.
{code:java}
process.py
file_processor
  config
spark.py
  repository
s3_repo.py
  structure
table_creator.py

{code}
The command hangs when the .foreachPartition code that is located inside 
s3_repo.py is called by process.py. When the same .foreachPartition code is 
moved from s3_repo.py and placed inside the process.py it runs just fine.

 

process.py
{code:java}
from file_processor.structure import table_creator
from file_processor.repository import s3_repo

def process():
table_creator.create_table()
s3_repo.save_to_s3()

if __name__ == '__main__':
process()
{code}
spark.py
{code:java}
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Test").getOrCreate()
{code}
s3_repo.py

 

 

  was:I provided full description of the issue on Stack Overflow via the 
following link https://stackoverflow.com/questions/66300313


> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py 
> is what controls the flow of the application and calls code inside the 
> file_processor package.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> The command hangs when the .foreachPartition code that is located inside 
> s3_repo.py is called by process.py. When the same .foreachPartition code is 
> moved from s3_repo.py and placed inside the process.py it runs just fine.
>  
> process.py
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> spark.py
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> s3_repo.py
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34502) Remove unused parameters in join methods

2021-02-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-34502:
---

Assignee: Huaxin Gao

> Remove unused parameters in join methods
> 
>
> Key: SPARK-34502
> URL: https://issues.apache.org/jira/browse/SPARK-34502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
>
> Remove unused parameters in some join methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34502) Remove unused parameters in join methods

2021-02-23 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-34502.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31617
[https://github.com/apache/spark/pull/31617]

> Remove unused parameters in join methods
> 
>
> Key: SPARK-34502
> URL: https://issues.apache.org/jira/browse/SPARK-34502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Remove unused parameters in some join methods



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34380:


Assignee: Terry Kim  (was: Apache Spark)

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34380:


Assignee: Apache Spark  (was: Terry Kim)

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-34380:
-

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES

2021-02-23 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34380:

Fix Version/s: (was: 3.2.0)

> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
> 
>
> Key: SPARK-34380
> URL: https://issues.apache.org/jira/browse/SPARK-34380
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34472) SparkContext.addJar with an ivy path fails in cluster mode with a custom ivySettings file

2021-02-23 Thread Shardul Mahadik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289258#comment-17289258
 ] 

Shardul Mahadik commented on SPARK-34472:
-

[~xkrogen] raised a good point at 
https://github.com/apache/spark/pull/31591#discussion_r579324686 that we should 
refactor YarnClusterSuite to extract common parameter handling code to be 
shared across tests. Will do this as a followup.

> SparkContext.addJar with an ivy path fails in cluster mode with a custom 
> ivySettings file
> -
>
> Key: SPARK-34472
> URL: https://issues.apache.org/jira/browse/SPARK-34472
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> SPARK-33084 introduced support for Ivy paths in {{sc.addJar}} or Spark SQL 
> {{ADD JAR}}. If we use a custom ivySettings file using 
> {{spark.jars.ivySettings}}, it is loaded at 
> [https://github.com/apache/spark/blob/b26e7b510bbaee63c4095ab47e75ff2a70e377d7/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1280.]
>  However, this file is only accessible on the client machine. In cluster 
> mode, this file is not available on the driver and so {{addJar}} fails.
> {code:sh}
> spark-submit --master yarn --deploy-mode cluster --class IvyAddJarExample 
> --conf spark.jars.ivySettings=/path/to/ivySettings.xml example.jar
> {code}
> {code}
> java.lang.IllegalArgumentException: requirement failed: Ivy settings file 
> /path/to/ivySettings.xml does not exist
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.deploy.SparkSubmitUtils$.loadIvySettings(SparkSubmit.scala:1331)
>   at 
> org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:176)
>   at 
> org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:156)
>   at 
> org.apache.spark.sql.internal.SessionResourceLoader.resolveJars(SessionState.scala:166)
>   at 
> org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:133)
>   at 
> org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40)
>  {code}
> We should ship the ivySettings file to the driver so that {{addJar}} is able 
> to find it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34511) Current Security vulnerabilities in spark libraries

2021-02-23 Thread eoin (Jira)

eoin created SPARK-34511:


 Summary: Current Security vulnerabilities in spark libraries
 Key: SPARK-34511
 URL: https://issues.apache.org/jira/browse/SPARK-34511
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: eoin


The following libraries have the following vulnerabilities that will fail Nexus 
security scans. They are deemed as threats of level 7 and higher on the 
Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies 
as the are fixed in subsequent releases.
 
com.fasterxml.woodstox : woodstox-core : 5.0.3 * 
[https://github.com/FasterXML/woodstox/issues/50]
 * [https://github.com/FasterXML/woodstox/issues/51]
 * [https://github.com/FasterXML/woodstox/issues/61]

com.nimbusds : nimbus-jose-jwt : 4.41.1 * 
[https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt]
 * [https://connect2id.com/blog/nimbus-jose-jwt-7-9]

Log4j : log4j : 1.2.17
SocketServer class that is vulnerable to deserialization of untrusted data: * 
https://issues.apache.org/jira/browse/LOG4J2-1863
 * 
[https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E]
 * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616]

Dynamic-link Library (DLL) Preloading:
 * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323]

 
apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> 
https://issues.apache.org/jira/browse/XERCESJ-1685
 * 
[https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E]
 * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176]

 
com.fasterxml.jackson.core : jackson-databind : 2.10.0 * 
[https://github.com/FasterXML/jackson-databind/issues/2589]

 
commons-beanutils : commons-beanutils : 1.9.3 * 
[http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader]
 * https://issues.apache.org/jira/browse/BEANUTILS-463

 
commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52]
 * https://issues.apache.org/jira/browse/IO-556
 * https://issues.apache.org/jira/browse/IO-559

 
io.netty : netty-all : 4.1.47.Final * 
[https://github.com/netty/netty/issues/10351]
 * [https://github.com/netty/netty/pull/10560]

 
org.apache.commons : commons-compress : 1.18 * 
[https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities]

 
org.apache.hadoop : hadoop-hdfs : 2.7.4 * 
[https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E]
 * 
[https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E]
 * [https://hadoop.apache.org/cve_list.html]
 * [https://www.openwall.com/lists/oss-security/2019/01/24/3]
 
org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * 
[https://bugzilla.redhat.com/show_bug.cgi?id=1516399]
 * 
[https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E]

 
org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * 
[https://github.com/FasterXML/jackson-databind/issues/1599]
 * [https://blog.sonatype.com/jackson-databind-remote-code-execution]
 * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist]
 * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525]
 * [https://access.redhat.com/security/cve/cve-2019-10172]
 * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075]
 * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172]

 
org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * 
[https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096]

 
org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * 
[https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921]
 * [https://github.com/eclipse/jetty.project/issues/5451]
 * 
[https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Summary: .foreachPartition command hangs when ran inside Python package but 
works when ran from Python file outside the package on EMR  (was: 
.foreachPartition command hangs when ran inside Python package but works when 
ran from Python file outside the package)

> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I provided full description of the issue on Stack Overflow via the following 
> link https://stackoverflow.com/questions/66300313



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package

2021-02-23 Thread Yuriy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuriy updated SPARK-34510:
--
Attachment: Code.zip

> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package
> --
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I provided full description of the issue on Stack Overflow via the following 
> link https://stackoverflow.com/questions/66300313



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-23 Thread Chao Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as 
error messages, stack traces, steps to reproduce the issue, etc?

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package

2021-02-23 Thread Yuriy (Jira)

Yuriy created SPARK-34510:
-

 Summary: .foreachPartition command hangs when ran inside Python 
package but works when ran from Python file outside the package
 Key: SPARK-34510
 URL: https://issues.apache.org/jira/browse/SPARK-34510
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 3.0.0
Reporter: Yuriy


I provided full description of the issue on Stack Overflow via the following 
link https://stackoverflow.com/questions/66300313



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S

2021-02-23 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289158#comment-17289158
 ] 

Attila Zsolt Piros commented on SPARK-34509:


I am working on this.

> Make dynamic allocation upscaling more progressive on K8S
> -
>
> Key: SPARK-34509
> URL: https://issues.apache.org/jira/browse/SPARK-34509
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently even a single late pod request stops upscaling. As we have 
> allocation batch size it would be better to go up to that limit as soon as 
> possible (if serving pod requests are slow we have to make them as early as 
> possible). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S

2021-02-23 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-34509:
--

 Summary: Make dynamic allocation upscaling more progressive on K8S
 Key: SPARK-34509
 URL: https://issues.apache.org/jira/browse/SPARK-34509
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.2, 2.4.7, 2.3.4, 3.2.0
Reporter: Attila Zsolt Piros


Currently even a single late pod request stops upscaling. As we have allocation 
batch size it would be better to go up to that limit as soon as possible (if 
serving pod requests are slow we have to make them as early as possible). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289140#comment-17289140
 ] 

Apache Spark commented on SPARK-34508:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31627

> skip HiveExternalCatalogVersionsSuite if network is down
> 
>
> Key: SPARK-34508
> URL: https://issues.apache.org/jira/browse/SPARK-34508
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34508:


Assignee: (was: Apache Spark)

> skip HiveExternalCatalogVersionsSuite if network is down
> 
>
> Key: SPARK-34508
> URL: https://issues.apache.org/jira/browse/SPARK-34508
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34508:


Assignee: Apache Spark

> skip HiveExternalCatalogVersionsSuite if network is down
> 
>
> Key: SPARK-34508
> URL: https://issues.apache.org/jira/browse/SPARK-34508
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down

2021-02-23 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-34508:
---

 Summary: skip HiveExternalCatalogVersionsSuite if network is down
 Key: SPARK-34508
 URL: https://issues.apache.org/jira/browse/SPARK-34508
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-23 Thread Yakov Kerzhner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289113#comment-17289113
 ] 

Yakov Kerzhner edited comment on SPARK-34448 at 2/23/21, 2:54 PM:
--

As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds).   If there is 
nothing there of this sort, then a step through of the minimization process 
should shed some clues as to why the intercept isnt budging from the initial 
value given.


was (Author: ykerzhner):
As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds). 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Commented] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules

2021-02-23 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289116#comment-17289116
 ] 

Apache Spark commented on SPARK-34168:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31625

> Support DPP in AQE When the join is Broadcast hash join before applying the 
> AQE rules
> -
>
> Key: SPARK-34168
> URL: https://issues.apache.org/jira/browse/SPARK-34168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.2.0
>
>
> Both AQE and DPP cannot be applied at the same time. This PR will enable AQE 
> and DPP when the join is Broadcast hash join at the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-23 Thread Yakov Kerzhner (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289113#comment-17289113
 ] 

Yakov Kerzhner commented on SPARK-34448:


As I said in the description, I do not believe that the starting point should 
cause this bug; the minimizer should still drift to the proper minimum.  I said 
the fact that the log(odds) was made the starting point seems to suggest that 
whoever wrote the code believed that the intercept should be close to the 
log(odds), which is only true if the data is centered.  If I had to guess, I 
would guess that there is something in the objective function that pulls the 
intercept towards the log(odds).  This would be a bug, as the log(odds) is a 
good approximation for the intercept if and only if the data is centered.  For 
non-centered data, it is completely wrong to have the intercept equal (or be 
close to) the log(odds).  My test shows precisely this, that when the data is 
not centered, spark still returns an intercept equal to the log(odds) (test 
2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct 
intercept: -4).  Indeed, even for centered data, (test 1.b), it returns an 
intercept almost equal to the log(odds), (test 1.b. log(odds): 
-3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4).  So 
we need to dig into the objective function, and whether somewhere in there is a 
term that penalizes the intercept moving away from the log(odds). 

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12

2021-02-23 Thread Guillaume Martres (Jira)

Guillaume Martres created SPARK-34507:
-

 Summary: Spark artefacts built against Scala 2.13 incorrectly 
depend on Scala 2.12
 Key: SPARK-34507
 URL: https://issues.apache.org/jira/browse/SPARK-34507
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.2.0
Reporter: Guillaume Martres


Snapshots of Spark 3.2 built against Scala 2.13 are available at 
[https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,]
 but they seem to depend on Scala 2.12. Specifically if I look at 
[https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom]
 I see:
{code:java}
2.12.10
2.13https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65]
 needs to be updated to also change the `scala.version` and not just the 
`scala.binary.version`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2021-02-23 Thread Guillaume Martres (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289104#comment-17289104
 ] 

Guillaume Martres commented on SPARK-25075:
---

I've opened https://issues.apache.org/jira/browse/SPARK-34507.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, MLlib, Project Infra, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 128 matches

Mail list logo