[jira] [Commented] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more
[ https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289745#comment-17289745 ] Apache Spark commented on SPARK-31891: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31633 > `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if > partition specific location is not exist any more > --- > > Key: SPARK-31891 > URL: https://issues.apache.org/jira/browse/SPARK-31891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Currently when execute > {code:sql} > ALTER TABLE multipartIdentifier RECOVER PARTITIONS > {code} > It will auto add the partition according to the table root location > structure. > And Spark need to add one more step to check if the existing partition > specific location exists and then delete the partition? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more
[ https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289744#comment-17289744 ] Apache Spark commented on SPARK-31891: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31633 > `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if > partition specific location is not exist any more > --- > > Key: SPARK-31891 > URL: https://issues.apache.org/jira/browse/SPARK-31891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Currently when execute > {code:sql} > ALTER TABLE multipartIdentifier RECOVER PARTITIONS > {code} > It will auto add the partition according to the table root location > structure. > And Spark need to add one more step to check if the existing partition > specific location exists and then delete the partition? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289728#comment-17289728 ] Xiaochen Ouyang commented on SPARK-33212: - Thanks for your reply [~csun] ! Submmit command: spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark-examples*.jar In ApplicationMaster.scala /** Add the Yarn IP filter that is required for properly securing the UI. */ private def addAmIpFilter(driver: Option[RpcEndpointRef]) = { val proxyBase = System.getenv(ApplicationConstants.APPLICATION_WEB_PROXY_BASE_ENV) {color:#de350b}val amFilter = "org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter"{color} val params = client.getAmIpFilterParams(yarnConf, proxyBase) driver match { case Some(d) => d.send(AddWebUIFilter(amFilter, params.toMap, proxyBase)) case None => System.setProperty("spark.ui.filters", amFilter) params.foreach \{ case (k, v) => System.setProperty(s"spark.$amFilter.param.$k", v) } } } We need load hadoop-yarn-server-web-proxy.jar into driver classloader when submitting a spark on yarn application . Do you mean that we should copy hadoop-yarn-server-web-proxy.jar to spark/jars ? 1. AMIpFilter ClassNotFoundException: 2021-02-24 14:52:56,617 INFO org.apache.spark.storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-worker-2, 38399, None) 2021-02-24 14:52:56,704 INFO org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> spark-worker-1,spark-worker-2, PROXY_URI_BASES -> http://spark-worker-1:8088/proxy/application_1613961532167_0098,http://spark-worker-2:8088/proxy/application_1613961532167_0098, RM_HA_URLS -> spark-worker-1:8088,spark-worker-2:8088), /proxy/application_1613961532167_0098 2021-02-24 14:52:56,708 INFO org.apache.spark.ui.JettyUtils: Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to /jobs, /jobs/json, /jobs/job, /jobs/job/json, /stages, /stages/json, /stages/stage, /stages/stage/json, /stages/pool, /stages/pool/json, /storage, /storage/json, /storage/rdd, /storage/rdd/json, /environment, /environment/json, /executors, /executors/json, /executors/threadDump, /executors/threadDump/json, /logLevel, /static, /, /api, /jobs/job/kill, /stages/stage/kill. 2021-02-24 14:52:56,722 WARN org.spark_project.jetty.servlet.BaseHolder: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.spark_project.jetty.util.Loader.loadClass(Loader.java:86) at org.spark_project.jetty.servlet.BaseHolder.doStart(BaseHolder.java:95) at org.spark_project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:92) at org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) at org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:872) at org.spark_project.jetty.servlet.ServletHandler.updateMappings(ServletHandler.java:1596) at org.spark_project.jetty.servlet.ServletHandler.setFilterMappings(ServletHandler.java:1659) at org.spark_project.jetty.servlet.ServletHandler.addFilterMapping(ServletHandler.java:1297) at org.spark_project.jetty.servlet.ServletHandler.addFilterWithMapping(ServletHandler.java:1145) at org.spark_project.jetty.servlet.ServletContextHandler.addFilter(ServletContextHandler.java:448) at org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1$$anonfun$apply$1.apply(JettyUtils.scala:325) at org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1$$anonfun$apply$1.apply(JettyUtils.scala:294) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1.apply(JettyUtils.scala:294) at org.apache.spark.ui.JettyUtils$$anonfun$addFilters$1.apply(JettyUtils.scala:293) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.ui.JettyUtils$.addFilters(JettyUtils.scala:293) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$addWebUIFilter$3.apply(YarnSchedulerBackend.scala:176) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend$$anonfun$org$apache$spark$scheduler$cluster$YarnSchedulerBackend$$addWebUIFilter$3.apply(YarnSchedulerBackend.scala:176) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.org$apache$spark$scheduler$cluster$YarnSchedulerB
[jira] [Updated] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34515: - Affects Version/s: 3.1.2 > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.1.2 >Reporter: ulysses you >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34152) CreateViewStatement.child should be a real child
[ https://issues.apache.org/jira/browse/SPARK-34152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34152. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31273 [https://github.com/apache/spark/pull/31273] > CreateViewStatement.child should be a real child > > > Key: SPARK-34152 > URL: https://issues.apache.org/jira/browse/SPARK-34152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > Similar to `CreateTableAsSelectStatement`, the input query of > `CreateViewStatement` should be a child and get analyzed during the analysis > phase. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34414) OptimizeMetadataOnlyQuery should only apply for deterministic filters
[ https://issues.apache.org/jira/browse/SPARK-34414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yesheng Ma resolved SPARK-34414. Resolution: Invalid > OptimizeMetadataOnlyQuery should only apply for deterministic filters > - > > Key: SPARK-34414 > URL: https://issues.apache.org/jira/browse/SPARK-34414 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Yesheng Ma >Priority: Major > > Similar to FileSourcePartitionPruning, OptimizeMetadataOnlyQuery should only > apply for deterministic filters. If filters are non-deterministic, they have > to be evaluated against partitions separately. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289680#comment-17289680 ] angerszhu edited comment on SPARK-34516 at 2/24/21, 6:34 AM: - For this error , I found some related issue: [https://github.com/trinodb/trino/issues/2256] (Not so clear) https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by parquet reader's logic) https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed in parquet versison used by spark 3.0.1) Check the parquet's code about this part, it just decode PageHeader from a data stream. Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon] I am not sure if it is related to spark's vectorized reader of parquet. Can you take a look and give some advise? was (Author: angerszhuuu): For this error , I found some related issue: [https://github.com/trinodb/trino/issues/2256] (Not so clear) https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by parquet reader's logic) https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed in parquet versison used by spark 3.0.1) Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon] I am not sure if it is related to spark's vectorized reader of parquet. Can you take a look and give some advise? > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289684#comment-17289684 ] angerszhu edited comment on SPARK-34516 at 2/24/21, 6:32 AM: - [~maropu] Sort it out after desensitization and update later. was (Author: angerszhuuu): [~maropu] Sort it out after desensitization and update later. > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289684#comment-17289684 ] angerszhu commented on SPARK-34516: --- [~maropu] Sort it out after desensitization and update later. > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289680#comment-17289680 ] angerszhu commented on SPARK-34516: --- For this error , I found some related issue: [https://github.com/trinodb/trino/issues/2256] (Not so clear) https://issues.apache.org/jira/browse/DRILL-3871 (seems same issue, caused by parquet reader's logic) https://issues.apache.org/jira/browse/PARQUET-400 (looks like it has been fixed in parquet versison used by spark 3.0.1) Gentle ping [~lian cheng] [~viirya] [~maxgekk] [~dongjoon] I am not sure if it is related to spark's vectorized reader of parquet. Can you take a look and give some advise? > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289677#comment-17289677 ] Takeshi Yamamuro commented on SPARK-34516: -- What's a query to reproduce this issue? > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
angerszhu created SPARK-34516: - Summary: Spark 3.0.1 encounter parquet PageHerder IO issue Key: SPARK-34516 URL: https://issues.apache.org/jira/browse/SPARK-34516 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: angerszhu {code:java} Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) at {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34494) Move data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-34494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34494: - Affects Version/s: (was: 3.0.2) 3.2.0 > Move data source options from Python and Scala into a single page. > -- > > Key: SPARK-34494 > URL: https://issues.apache.org/jira/browse/SPARK-34494 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34493) Create "TEXT Files" page for Data Source documents.
[ https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34493: - Affects Version/s: (was: 3.0.2) 3.2.0 > Create "TEXT Files" page for Data Source documents. > --- > > Key: SPARK-34493 > URL: https://issues.apache.org/jira/browse/SPARK-34493 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Adding "TEXT Files" page to [Data Sources > documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources] > which is missing now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34492) Create "CSV Files" page for Data Source documents.
[ https://issues.apache.org/jira/browse/SPARK-34492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34492: - Affects Version/s: (was: 3.0.2) 3.2.0 > Create "CSV Files" page for Data Source documents. > -- > > Key: SPARK-34492 > URL: https://issues.apache.org/jira/browse/SPARK-34492 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Adding "CSV Files" page to [Data Sources > documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources] > which is missing now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34246) New type coercion syntax rules in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-34246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-34246. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31349 [https://github.com/apache/spark/pull/31349] > New type coercion syntax rules in ANSI mode > --- > > Key: SPARK-34246 > URL: https://issues.apache.org/jira/browse/SPARK-34246 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.2 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Add new implicit cast syntax rules in ANSI mode. > In Spark ANSI mode, the type coercion rules are based on the type precedence > lists of the input data types. > As per the section "Type precedence list determination" of "ISO/IEC > 9075-2:2011 > Information technology — Database languages - SQL — Part 2: Foundation > (SQL/Foundation)", the type precedence lists of primitive > data types are as following: > * Byte: Byte, Short, Int, Long, Decimal, Float, Double > * Short: Short, Int, Long, Decimal, Float, Double > * Int: Int, Long, Decimal, Float, Double > * Long: Long, Decimal, Float, Double > * Decimal: Any wider Numeric type > * Float: Float, Double > * Double: Double > * String: String > * Date: Date, Timestamp > * Timestamp: Timestamp > * Binary: Binary > * Boolean: Boolean > * Interval: Interval > As for complex data types, Spark will determine the precedent list > recursively based on their sub-types. > With the definition of type precedent list, the general type coercion rules > are as following: > * Data type S is allowed to be implicitly cast as type T iff T is in the > precedence list of S > * Comparison is allowed iff the data type precedence list of both sides has > at least one common element. When evaluating the comparison, Spark casts both > sides as the tightest common data type of their precedent lists. > * There should be at least one common data type among all the children's > precedence lists for the following operators. The data type of the operator > is the tightest common precedent data type. > {code:java} > In > Except(odd) > Intersect > Greatest > Least > Union > If > CaseWhen > CreateArray > Array Concat > Sequence > MapConcat > CreateMap > {code} > * For complex types (struct, array, map), Spark recursively looks into the > element type and applies the rules above. If the element nullability is > converted from true to false, add runtime null check to the elements. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34497: - Affects Version/s: 3.2.0 > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0, 3.2.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34497: - Affects Version/s: (was: 3.1.0) 3.1.2 > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.2 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289669#comment-17289669 ] Takeshi Yamamuro commented on SPARK-34497: -- Please fill the description. > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289667#comment-17289667 ] Apache Spark commented on SPARK-34515: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/31632 > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34515: Assignee: (was: Apache Spark) > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34515: Assignee: Apache Spark > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289665#comment-17289665 ] Apache Spark commented on SPARK-34515: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/31632 > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34290) Support v2 TRUNCATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-34290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34290. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31605 [https://github.com/apache/spark/pull/31605] > Support v2 TRUNCATE TABLE > - > > Key: SPARK-34290 > URL: https://issues.apache.org/jira/browse/SPARK-34290 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Need to implement TRUNCATE TABLE for DSv2 tables similarly to v1 > implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34290) Support v2 TRUNCATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-34290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34290: --- Assignee: Maxim Gekk > Support v2 TRUNCATE TABLE > - > > Key: SPARK-34290 > URL: https://issues.apache.org/jira/browse/SPARK-34290 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Need to implement TRUNCATE TABLE for DSv2 tables similarly to v1 > implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34245) Master may not remove the finished executor when Worker fails to send ExecutorStateChanged
[ https://issues.apache.org/jira/browse/SPARK-34245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34245. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31348 [https://github.com/apache/spark/pull/31348] > Master may not remove the finished executor when Worker fails to send > ExecutorStateChanged > -- > > Key: SPARK-34245 > URL: https://issues.apache.org/jira/browse/SPARK-34245 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.4.7, 3.0.1, 3.2.0, 3.1.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > > If the Worker fails to send ExecutorStateChanged to the Master due to some > errors, e.g., temporary network error, then the Master can't remove the > finished executor normally and think the executor is still alive. In the > worst case, if the executor is the only one executor for the application, the > application can get hang. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34245) Master may not remove the finished executor when Worker fails to send ExecutorStateChanged
[ https://issues.apache.org/jira/browse/SPARK-34245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34245: --- Assignee: wuyi > Master may not remove the finished executor when Worker fails to send > ExecutorStateChanged > -- > > Key: SPARK-34245 > URL: https://issues.apache.org/jira/browse/SPARK-34245 > Project: Spark > Issue Type: Improvement > Components: Deploy, Spark Core >Affects Versions: 2.4.7, 3.0.1, 3.2.0, 3.1.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > If the Worker fails to send ExecutorStateChanged to the Master due to some > errors, e.g., temporary network error, then the Master can't remove the > finished executor normally and think the executor is still alive. In the > worst case, if the executor is the only one executor for the application, the > application can get hang. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
[ https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-34515: Description: Spark will convert InSet to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning partition . At this case, if values contain a null, we will get such exception {code:java} java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike.sorted(SeqLike.scala:659) at scala.collection.SeqLike.sorted$(SeqLike.scala:647) at scala.collection.AbstractSeq.sorted(Seq.scala:45) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) {code} was: Spark will convert InSet to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold`. At this case, if values contain a null, we will get such exception {code:java} java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike.sorted(SeqLike.scala:659) at scala.collection.SeqLike.sorted$(SeqLike.scala:647) at scala.collection.AbstractSeq.sorted(Seq.scala:45) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) {code} > Fix NPE if InSet contains null value during getPartitionsByFilter > - > > Key: SPARK-34515 > URL: https://issues.apache.org/jira/browse/SPARK-34515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > Spark will convert InSet to `>= and <=` if it's values size over > `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning > partition . At this case, if values contain a null, we will get such > exception > > {code:java} > java.lang.NullPointerException > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) > at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) > at > scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike.sorted(SeqLike.scala:659) > at scala.collection.SeqLike.sorted$(SeqLike.scala:647) > at scala.collection.AbstractSeq.sorted(Seq.scala:45) > at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) > at scala.collection.immutable.Stream.flatMap(Stream.scala:489) > at > org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text
[ https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289653#comment-17289653 ] Apache Spark commented on SPARK-33504: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31631 > The application log in the Spark history server contains sensitive attributes > such as password that should be redated instead of plain text > --- > > Key: SPARK-33504 > URL: https://issues.apache.org/jira/browse/SPARK-33504 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Spark 3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Major > Fix For: 3.1.0 > > Attachments: SparkListenerEnvironmentUpdate log shows ok.png, > SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png > > > We found the secure attributes in SparkListenerJobStart and > SparkListenerStageSubmitted events would not been redated, resulting in > sensitive attributes can be viewd directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33504) The application log in the Spark history server contains sensitive attributes such as password that should be redated instead of plain text
[ https://issues.apache.org/jira/browse/SPARK-33504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289654#comment-17289654 ] Apache Spark commented on SPARK-33504: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31631 > The application log in the Spark history server contains sensitive attributes > such as password that should be redated instead of plain text > --- > > Key: SPARK-33504 > URL: https://issues.apache.org/jira/browse/SPARK-33504 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 > Environment: Spark 3.0.1 >Reporter: akiyamaneko >Assignee: akiyamaneko >Priority: Major > Fix For: 3.1.0 > > Attachments: SparkListenerEnvironmentUpdate log shows ok.png, > SparkListenerStageSubmitted-log-wrong.png, SparkListernerJobStart-wrong.png > > > We found the secure attributes in SparkListenerJobStart and > SparkListenerStageSubmitted events would not been redated, resulting in > sensitive attributes can be viewd directly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter
ulysses you created SPARK-34515: --- Summary: Fix NPE if InSet contains null value during getPartitionsByFilter Key: SPARK-34515 URL: https://issues.apache.org/jira/browse/SPARK-34515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: ulysses you Spark will convert InSet to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold`. At this case, if values contain a null, we will get such exception {code:java} java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike.sorted(SeqLike.scala:659) at scala.collection.SeqLike.sorted$(SeqLike.scala:647) at scala.collection.AbstractSeq.sorted(Seq.scala:45) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289652#comment-17289652 ] Chao Sun commented on SPARK-33212: -- Thanks for the details [~ouyangxc.zte]! {quote} Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath {quote} This is interesting. the {{hadoop-client-minicluster.jar}} should only be used in tests - curious why it is needed here. Could you share stacktraces for the {{ClassNotFoundException}}? {quote} 2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter {quote} Could you also share the stacktraces for this exception? And to confirm, you are using {{client}} as the deploy mode, is that correct? I'll try to reproduce this in my local environment. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34488) Support task Metrics Distributions and executor Metrics Distributions in the REST API call for a specified stage
[ https://issues.apache.org/jira/browse/SPARK-34488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289646#comment-17289646 ] Ron Hu commented on SPARK-34488: It should be noted that this Jira addresses query parameter withSummaries for a specific stage. In another Jira https://issues.apache.org/jira/browse/SPARK-26399, we address query parameter withSummaries for overall stages. > Support task Metrics Distributions and executor Metrics Distributions in the > REST API call for a specified stage > > > Key: SPARK-34488 > URL: https://issues.apache.org/jira/browse/SPARK-34488 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.2 >Reporter: Ron Hu >Priority: Major > Attachments: executorMetricsDistributions.json, > taskMetricsDistributions.json > > > For a specific stage, it is useful to show the task metrics in percentile > distribution. This information can help users know whether or not there is a > skew/bottleneck among tasks in a given stage. We list an example in > [^taskMetricsDistributions.json] > Similarly, it is useful to show the executor metrics in percentile > distribution for a specific stage. This information can show whether or not > there is a skewed load on some executors. We list an example in > [^executorMetricsDistributions.json] > > We define withSummaries query parameter in the REST API for a specific stage > as: > applications///?withSummaries=[true|false] > When withSummaries=true, both task metrics in percentile distribution and > executor metrics in percentile distribution are included in the REST API > output. The default value of withSummaries is false, i.e. no metrics > percentile distribution will be included in the REST API output. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34092) support filtering by task status in REST API call for a specific stage
[ https://issues.apache.org/jira/browse/SPARK-34092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289644#comment-17289644 ] Ron Hu commented on SPARK-34092: It should be noted that this Jira addresses query parameter taskStatus for a specific stage. In another Jira https://issues.apache.org/jira/browse/SPARK-26399, we address query parameter taskStatus for overall stages. > support filtering by task status in REST API call for a specific stage > -- > > Key: SPARK-34092 > URL: https://issues.apache.org/jira/browse/SPARK-34092 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Query parameter taskStatus can be used to filter the tasks meeting a specific > status in the REST API call for a given stage. We want to support the > following REST API calls: > applications///stages/?details=true&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING] > applications///stages//?details=true&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING] > Need to set details=true in order to drill down to the task level within a > specified stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-34507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289638#comment-17289638 ] Yang Jie commented on SPARK-34507: -- It seems that the [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom] is original pom.xml, not effective pom.xml > Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12 > - > > Key: SPARK-34507 > URL: https://issues.apache.org/jira/browse/SPARK-34507 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Guillaume Martres >Priority: Major > > Snapshots of Spark 3.2 built against Scala 2.13 are available at > [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,] > but they seem to depend on Scala 2.12. Specifically if I look at > [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom] > I see: > {code:java} > 2.12.10 > 2.13 It looks like > [https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65] > needs to be updated to also change the `scala.version` and not just the > `scala.binary.version`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-34507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289637#comment-17289637 ] Yang Jie commented on SPARK-34507: -- I think scala-2.13 profile should overrdie this property: {code:java} scala-2.13 2.13.4 2.13 ... {code} > Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12 > - > > Key: SPARK-34507 > URL: https://issues.apache.org/jira/browse/SPARK-34507 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Guillaume Martres >Priority: Major > > Snapshots of Spark 3.2 built against Scala 2.13 are available at > [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,] > but they seem to depend on Scala 2.12. Specifically if I look at > [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom] > I see: > {code:java} > 2.12.10 > 2.13 It looks like > [https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65] > needs to be updated to also change the `scala.version` and not just the > `scala.binary.version`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
[ https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34514: Assignee: Apache Spark > Push down limit for LEFT SEMI and LEFT ANTI join > > > Key: SPARK-34514 > URL: https://issues.apache.org/jira/browse/SPARK-34514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > I found out during code review of > [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( > [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where > we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if > the join condition is empty. > Why it's safe to push down limit: > The semantics of LEFT SEMI join without condition: > (1). if right side is non-empty, output all rows from left side. > (2). if right side is empty, output nothing. > > The semantics of LEFT ANTI join without condition: > (1). if right side is non-empty, output nothing. > (2). if right side is empty, output all rows from left side. > > With the semantics of output all rows from left side or nothing (all or > nothing), it's safe to push down limit to left side. > NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for > limit push down, because output can be a portion of left side rows. > > Physical operator for LEFT SEMI / LEFT ANTI join without condition - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] > . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
[ https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289625#comment-17289625 ] Apache Spark commented on SPARK-34514: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/31630 > Push down limit for LEFT SEMI and LEFT ANTI join > > > Key: SPARK-34514 > URL: https://issues.apache.org/jira/browse/SPARK-34514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > I found out during code review of > [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( > [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where > we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if > the join condition is empty. > Why it's safe to push down limit: > The semantics of LEFT SEMI join without condition: > (1). if right side is non-empty, output all rows from left side. > (2). if right side is empty, output nothing. > > The semantics of LEFT ANTI join without condition: > (1). if right side is non-empty, output nothing. > (2). if right side is empty, output all rows from left side. > > With the semantics of output all rows from left side or nothing (all or > nothing), it's safe to push down limit to left side. > NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for > limit push down, because output can be a portion of left side rows. > > Physical operator for LEFT SEMI / LEFT ANTI join without condition - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] > . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
[ https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34514: Assignee: Apache Spark > Push down limit for LEFT SEMI and LEFT ANTI join > > > Key: SPARK-34514 > URL: https://issues.apache.org/jira/browse/SPARK-34514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > I found out during code review of > [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( > [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where > we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if > the join condition is empty. > Why it's safe to push down limit: > The semantics of LEFT SEMI join without condition: > (1). if right side is non-empty, output all rows from left side. > (2). if right side is empty, output nothing. > > The semantics of LEFT ANTI join without condition: > (1). if right side is non-empty, output nothing. > (2). if right side is empty, output all rows from left side. > > With the semantics of output all rows from left side or nothing (all or > nothing), it's safe to push down limit to left side. > NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for > limit push down, because output can be a portion of left side rows. > > Physical operator for LEFT SEMI / LEFT ANTI join without condition - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] > . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
[ https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34514: Assignee: (was: Apache Spark) > Push down limit for LEFT SEMI and LEFT ANTI join > > > Key: SPARK-34514 > URL: https://issues.apache.org/jira/browse/SPARK-34514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > I found out during code review of > [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( > [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where > we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if > the join condition is empty. > Why it's safe to push down limit: > The semantics of LEFT SEMI join without condition: > (1). if right side is non-empty, output all rows from left side. > (2). if right side is empty, output nothing. > > The semantics of LEFT ANTI join without condition: > (1). if right side is non-empty, output nothing. > (2). if right side is empty, output all rows from left side. > > With the semantics of output all rows from left side or nothing (all or > nothing), it's safe to push down limit to left side. > NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for > limit push down, because output can be a portion of left side rows. > > Physical operator for LEFT SEMI / LEFT ANTI join without condition - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] > . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618 ] Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:33 AM: --- Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1. Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2. Rebuild spark project 、depoly binary jars and submit application 3. Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver process. was (Author: ouyangxc.zte): Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、Rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver process. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618 ] Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:29 AM: --- Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、Rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver process. was (Author: ouyangxc.zte): Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver process. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
[ https://issues.apache.org/jira/browse/SPARK-34514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-34514: - Description: I found out during code review of [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty. Why it's safe to push down limit: The semantics of LEFT SEMI join without condition: (1). if right side is non-empty, output all rows from left side. (2). if right side is empty, output nothing. The semantics of LEFT ANTI join without condition: (1). if right side is non-empty, output nothing. (2). if right side is empty, output all rows from left side. With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side. NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows. Physical operator for LEFT SEMI / LEFT ANTI join without condition - [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] . was: I found out during code review of [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty. Why it's safe to push down limit: The semantics of LEFT SEMI join without condition: (1). if right side is non-empty, output all rows from left side. (2). if right side is empty, output nothing. The semantics of LEFT ANTI join without condition: (1). if right side is non-empty, output nothing. (2). if right side is empty, output all rows from left side. With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side. NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows. > Push down limit for LEFT SEMI and LEFT ANTI join > > > Key: SPARK-34514 > URL: https://issues.apache.org/jira/browse/SPARK-34514 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > I found out during code review of > [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( > [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where > we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if > the join condition is empty. > Why it's safe to push down limit: > The semantics of LEFT SEMI join without condition: > (1). if right side is non-empty, output all rows from left side. > (2). if right side is empty, output nothing. > > The semantics of LEFT ANTI join without condition: > (1). if right side is non-empty, output nothing. > (2). if right side is empty, output all rows from left side. > > With the semantics of output all rows from left side or nothing (all or > nothing), it's safe to push down limit to left side. > NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for > limit push down, because output can be a portion of left side rows. > > Physical operator for LEFT SEMI / LEFT ANTI join without condition - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala#L200-L204] > . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618 ] Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:28 AM: --- Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver process. was (Author: ouyangxc.zte): Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver proceess. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618 ] Xiaochen Ouyang edited comment on SPARK-33212 at 2/24/21, 3:28 AM: --- Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ in parent pom.xml and resource-manager/yarn/pom.xml.{color} 2、rebuild spark project 、depoly binary jars and submit application 3、Get a new Exception as follows: +2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+ The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import +{color:#de350b}org.apache.hadoop.shaded.javax.servlet.Filter{color}+'. So, AmIpFilter can't be reflected in spark dirver proceess. was (Author: ouyangxc.zte): Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ {color:#172b4d}in parent pom.xml and resource-manager/yarn/pom.xml.{color}{color} {color:#de350b}{color:#172b4d}2、rebuild spark project 、depoly binary jars and submit application{color}{color} {color:#de350b}{color:#172b4d}3、Get a new Exception as follows:{color}{color} {color:#de350b}+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+{color} The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import org.apache.hadoop.shaded.javax.servlet.Filter'. So, AmIpFilter can't be reflected in spark dirver proceess. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289618#comment-17289618 ] Xiaochen Ouyang commented on SPARK-33212: - Hi [~csun], we submit a spark application with command `spark-submit --master yarn --class org.apache.spark.examples.SparkPi /opt/spark/examples/jars/spark*.jar`. 1、Get AMIpFilter ClassNotFoundException , because there is no 'hadoop-client-minicluster.jar' in classpath. So we remove the line {color:#de350b}_'test'_ {color:#172b4d}in parent pom.xml and resource-manager/yarn/pom.xml.{color}{color} {color:#de350b}{color:#172b4d}2、rebuild spark project 、depoly binary jars and submit application{color}{color} {color:#de350b}{color:#172b4d}3、Get a new Exception as follows:{color}{color} {color:#de350b}+2021-02-24 08:36:54,391 ERROR org.apache.spark.SparkContext: Error initializing SparkContext. java.lang.IllegalStateException: class org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a javax.servlet.Filter+{color} The key reason is that spark dirver classloader load class `AmIpFilter` implements javax.servlet.Filter, but in shaded jar the class `Filter` imported like 'import org.apache.hadoop.shaded.javax.servlet.Filter'. So, AmIpFilter can't be reflected in spark dirver proceess. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34514) Push down limit for LEFT SEMI and LEFT ANTI join
Cheng Su created SPARK-34514: Summary: Push down limit for LEFT SEMI and LEFT ANTI join Key: SPARK-34514 URL: https://issues.apache.org/jira/browse/SPARK-34514 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Cheng Su I found out during code review of [https://github.com/apache/spark/pull/31567|https://github.com/apache/spark/pull/31567,]( [https://github.com/apache/spark/pull/31567#discussion_r577379572] ), where we can push down limit to the left side of LEFT SEMI and LEFT ANTI join, if the join condition is empty. Why it's safe to push down limit: The semantics of LEFT SEMI join without condition: (1). if right side is non-empty, output all rows from left side. (2). if right side is empty, output nothing. The semantics of LEFT ANTI join without condition: (1). if right side is non-empty, output nothing. (2). if right side is empty, output all rows from left side. With the semantics of output all rows from left side or nothing (all or nothing), it's safe to push down limit to left side. NOTE: LEFT SEMI / LEFT ANTI join with non-empty condition is not safe for limit push down, because output can be a portion of left side rows. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32703: Assignee: Apache Spark > Enable dictionary filtering for Parquet vectorized reader > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Minor > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32703: Assignee: (was: Apache Spark) > Enable dictionary filtering for Parquet vectorized reader > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32703: - Fix Version/s: (was: 3.2.0) > Enable dictionary filtering for Parquet vectorized reader > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader
[ https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-32703: -- Assignee: (was: Chao Sun) Reverted at https://github.com/apache/spark/commit/80bad086c806fd507b1fb197b171f87333f2fb08 > Enable dictionary filtering for Parquet vectorized reader > - > > Key: SPARK-32703 > URL: https://issues.apache.org/jira/browse/SPARK-32703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Priority: Minor > Fix For: 3.2.0 > > > Parquet vectorized reader still uses the old API for {{filterRowGroups}} and > only filters on statistics. It should switch to the new API and do dictionary > filtering as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34513) Kubernetes Spark Driver Pod Name Length Limitation
[ https://issues.apache.org/jira/browse/SPARK-34513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John updated SPARK-34513: - Description: Hi, We are using Spark in Airflow with the k8s-master. Airflow is attaching to our spark-driver pod a unique id utilizing the k8s-subdomain convention '.' This creates rather long pod-names. We noticed an issue with pod names in total (pod name + airflow attached uuid) exceeding 63 chars. Usually pod names can be up to 253 chars long. However Spark seems to have an issue with driver pod names which are longer than 63 characters. In our case the driver pod name is exactly 65 chars long, but Spark is omitting the last 2 chars in its error message. I assume internally Spark is loosing those two characters. Reducing our Driver Pod Name to just 63 charts fixed the issue. Here you can see the actual pod name (row 1) and the pod name from the Spark Error log (row 2) {code:java} ab-aa--cc-dd.3s092032c69f4639adff835a826e0120 ab-aa--cc-dd.3s092032c69f4639adff835a826e01{code} {code:java} [2021-02-20 00:30:06,289] {pod_launcher.py:136} INFO - Exception in thread "main" org.apache.spark.SparkException: No pod was found named Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the cluster in the namespace airflow-ns (this was supposed to be the driver pod.).{code} was: Hi, We are using Spark in Airflow with the k8s-master. Airflow is attaching to our spark-driver pod a unique id utilizing the k8s-subdomain convention '.' This creates rather long pod-names. We noticed an issue with pod names in total (pod name + airflow attached uuid) exceeding 63 chars. Usually pod names can be up to 253 chars long. However Spark seems to have an issue with driver pod names which are longer than 63 characters. In our case the driver pod name is exactly 65 chars long, but Spark is omitting the last 2 chars in its error message. I assume internally Spark is loosing those two characters. Reducing our Driver Pod Name to just 63 charts fixed the issue. Here you can see the actual pod name (row 1) and the pod name from the Spark Error log (row 2) ab-aa--cc-dd.3s092032c69f4639adff835a826e0120 ab-aa--cc-dd.3s092032c69f4639adff835a826e01 [2021-02-20 00:30:06,289] \{pod_launcher.py:136} INFO - Exception in thread "main" org.apache.spark.SparkException: No pod was found named Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the cluster in the namespace airflow-ns (this was supposed to be the driver pod.). > Kubernetes Spark Driver Pod Name Length Limitation > -- > > Key: SPARK-34513 > URL: https://issues.apache.org/jira/browse/SPARK-34513 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1 >Reporter: John >Priority: Major > > Hi, > We are using Spark in Airflow with the k8s-master. Airflow is attaching to > our spark-driver pod a unique id utilizing the k8s-subdomain convention '.' > This creates rather long pod-names. > We noticed an issue with pod names in total (pod name + airflow attached > uuid) exceeding 63 chars. Usually pod names can be up to 253 chars long. > However Spark seems to have an issue with driver pod names which are longer > than 63 characters. > In our case the driver pod name is exactly 65 chars long, but Spark is > omitting the last 2 chars in its error message. I assume internally Spark is > loosing those two characters. Reducing our Driver Pod Name to just 63 charts > fixed the issue. > Here you can see the actual pod name (row 1) and the pod name from the Spark > Error log (row 2) > {code:java} > ab-aa--cc-dd.3s092032c69f4639adff835a826e0120 > ab-aa--cc-dd.3s092032c69f4639adff835a826e01{code} > {code:java} > [2021-02-20 00:30:06,289] {pod_launcher.py:136} INFO - Exception in thread > "main" org.apache.spark.SparkException: No pod was found named > Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the > cluster in the namespace airflow-ns (this was supposed to be the driver > pod.).{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34513) Kubernetes Spark Driver Pod Name Length Limitation
John created SPARK-34513: Summary: Kubernetes Spark Driver Pod Name Length Limitation Key: SPARK-34513 URL: https://issues.apache.org/jira/browse/SPARK-34513 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1, 3.0.0 Reporter: John Hi, We are using Spark in Airflow with the k8s-master. Airflow is attaching to our spark-driver pod a unique id utilizing the k8s-subdomain convention '.' This creates rather long pod-names. We noticed an issue with pod names in total (pod name + airflow attached uuid) exceeding 63 chars. Usually pod names can be up to 253 chars long. However Spark seems to have an issue with driver pod names which are longer than 63 characters. In our case the driver pod name is exactly 65 chars long, but Spark is omitting the last 2 chars in its error message. I assume internally Spark is loosing those two characters. Reducing our Driver Pod Name to just 63 charts fixed the issue. Here you can see the actual pod name (row 1) and the pod name from the Spark Error log (row 2) ab-aa--cc-dd.3s092032c69f4639adff835a826e0120 ab-aa--cc-dd.3s092032c69f4639adff835a826e01 [2021-02-20 00:30:06,289] \{pod_launcher.py:136} INFO - Exception in thread "main" org.apache.spark.SparkException: No pod was found named Some(ab-aa--cc-dd.3s092032c69f4639adff835a826e01) in the cluster in the namespace airflow-ns (this was supposed to be the driver pod.). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25390) Data source V2 API refactoring
[ https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289577#comment-17289577 ] Rafael commented on SPARK-25390: Sorry for late response, I was able to migrate my project on Spark 3.0.0 Here some hints what I did: https://gist.github.com/rafaelkyrdan/2bea8385aadd71be5bf67cddeec59581 > Data source V2 API refactoring > -- > > Key: SPARK-25390 > URL: https://issues.apache.org/jira/browse/SPARK-25390 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Currently it's not very clear how we should abstract data source v2 API. The > abstraction should be unified between batch and streaming, or similar but > have a well-defined difference between batch and streaming. And the > abstraction should also include catalog/table. > An example of the abstraction: > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} > We should refactor the data source v2 API according to the abstraction -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289542#comment-17289542 ] shane knapp commented on SPARK-33044: - hey! sorry, i've been pretty slammed these past few weeks. i should be able to get this done by EOW. On Mon, Oct 19, 2020 at 1:36 PM Dongjoon Hyun (Jira) -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu > Add a Jenkins build and test job for Scala 2.13 > --- > > Key: SPARK-33044 > URL: https://issues.apache.org/jira/browse/SPARK-33044 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot > 2020-12-08 at 1.58.07 PM.png > > > {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a > Jenkins test job to verify current work results and CI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289527#comment-17289527 ] shane knapp commented on SPARK-33044: - 1) logins are temporarily disabled due to new campus network security standards. i need to find a non-manual-way of dealing with this asap. 2) i will get to this tomorrow. -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu > Add a Jenkins build and test job for Scala 2.13 > --- > > Key: SPARK-33044 > URL: https://issues.apache.org/jira/browse/SPARK-33044 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot > 2020-12-08 at 1.58.07 PM.png > > > {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a > Jenkins test job to verify current work results and CI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26138) Pushdown limit through InnerLike when condition is empty
[ https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-26138. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31567 [https://github.com/apache/spark/pull/31567] > Pushdown limit through InnerLike when condition is empty > > > Key: SPARK-26138 > URL: https://issues.apache.org/jira/browse/SPARK-26138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: guoxiaolong >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.2.0 > > > In LimitPushDown batch, cross join can push down the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32195) Standardize warning types and messages
[ https://issues.apache.org/jira/browse/SPARK-32195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289505#comment-17289505 ] RISHAV DUTTA commented on SPARK-32195: -- I am working on it. Can you takeup another issue? Thanks and regards Rishav Dutta > Standardize warning types and messages > -- > > Key: SPARK-32195 > URL: https://issues.apache.org/jira/browse/SPARK-32195 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > > Currently PySpark uses a somewhat inconsistent warning type and message such > as UserWarning. We should standardize it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26138) Pushdown limit through InnerLike when condition is empty
[ https://issues.apache.org/jira/browse/SPARK-26138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-26138: --- Assignee: Yuming Wang > Pushdown limit through InnerLike when condition is empty > > > Key: SPARK-26138 > URL: https://issues.apache.org/jira/browse/SPARK-26138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: guoxiaolong >Assignee: Yuming Wang >Priority: Minor > > In LimitPushDown batch, cross join can push down the limit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33044) Add a Jenkins build and test job for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-33044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289470#comment-17289470 ] shane knapp commented on SPARK-33044: - welp, i manually created a test job and it failed pretty early on: https://amplab.cs.berkeley.edu/jenkins/view/All/job/spark-master-test-maven-hadoop-3.2-hive-2.3-scala-2.13/1/ On Tue, Dec 8, 2020 at 10:59 AM Dongjoon Hyun (Jira) -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu > Add a Jenkins build and test job for Scala 2.13 > --- > > Key: SPARK-33044 > URL: https://issues.apache.org/jira/browse/SPARK-33044 > Project: Spark > Issue Type: Sub-task > Components: jenkins >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Shane Knapp >Priority: Major > Attachments: Screen Shot 2020-12-08 at 1.56.59 PM.png, Screen Shot > 2020-12-08 at 1.58.07 PM.png > > > {{Master}} branch seems to be almost ready for Scala 2.13 now, we need a > Jenkins test job to verify current work results and CI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34503) Use zstd for spark.eventLog.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-34503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34503. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31618 [https://github.com/apache/spark/pull/31618] > Use zstd for spark.eventLog.compression.codec by default > > > Key: SPARK-34503 > URL: https://issues.apache.org/jira/browse/SPARK-34503 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34503) Use zstd for spark.eventLog.compression.codec by default
[ https://issues.apache.org/jira/browse/SPARK-34503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34503: - Assignee: Dongjoon Hyun > Use zstd for spark.eventLog.compression.codec by default > > > Key: SPARK-34503 > URL: https://issues.apache.org/jira/browse/SPARK-34503 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34512) Disable validate default values when parsing Avro schemas
[ https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34512: Description: This is a regression problem. How to reproduce this issue: {code:scala} // Add this test to HiveSerDeReadWriteSuite test("SPARK-34512") { withTable("t1") { hiveClient.runSqlHive( """ |CREATE TABLE t1 | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | STORED AS INPUTFORMAT | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | OUTPUTFORMAT | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | TBLPROPERTIES ( |'avro.schema.literal'='{ | "namespace": "org.apache.spark.sql.hive.test", | "name": "schema_with_default_value", | "type": "record", | "fields": [ | { | "name": "ARRAY_WITH_DEFAULT", | "type": {"type": "array", "items": "string"}, | "default": null | } | ] |}') |""".stripMargin) spark.sql("select * from t1").show } } {code} {noformat} org.apache.avro.AvroTypeException: Invalid default for field ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"} at org.apache.avro.Schema.validateDefault(Schema.java:1571) at org.apache.avro.Schema.access$500(Schema.java:87) at org.apache.avro.Schema$Field.(Schema.java:544) at org.apache.avro.Schema.parse(Schema.java:1678) at org.apache.avro.Schema$Parser.parse(Schema.java:1425) at org.apache.avro.Schema$Parser.parse(Schema.java:1413) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263) at org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787) {noformat} was: How to reproduce this issue: {code:scala} // Add this test to HiveSerDeReadWriteSuite test("SPARK-34512") { withTable("t1") { hiveClient.runSqlHive( """ |CREATE TABLE t1 | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | STORED AS INPUTFORMAT | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | OUTPUTFORMAT | 'org.apache.hadoop.hive.ql
[jira] [Updated] (SPARK-34512) Disable validate default values when parsing Avro schemas
[ https://issues.apache.org/jira/browse/SPARK-34512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34512: Description: How to reproduce this issue: {code:scala} // Add this test to HiveSerDeReadWriteSuite test("SPARK-34512") { withTable("t1") { hiveClient.runSqlHive( """ |CREATE TABLE t1 | ROW FORMAT SERDE | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | STORED AS INPUTFORMAT | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | OUTPUTFORMAT | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | TBLPROPERTIES ( |'avro.schema.literal'='{ | "namespace": "org.apache.spark.sql.hive.test", | "name": "schema_with_default_value", | "type": "record", | "fields": [ | { | "name": "ARRAY_WITH_DEFAULT", | "type": {"type": "array", "items": "string"}, | "default": null | } | ] |}') |""".stripMargin) spark.sql("select * from t1").show } } {code} {noformat} org.apache.avro.AvroTypeException: Invalid default for field ARRAY_WITH_DEFAULT: null not a {"type":"array","items":"string"} at org.apache.avro.Schema.validateDefault(Schema.java:1571) at org.apache.avro.Schema.access$500(Schema.java:87) at org.apache.avro.Schema$Field.(Schema.java:544) at org.apache.avro.Schema.parse(Schema.java:1678) at org.apache.avro.Schema$Parser.parse(Schema.java:1425) at org.apache.avro.Schema$Parser.parse(Schema.java:1413) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.getSchemaFor(AvroSerdeUtils.java:268) at org.apache.hadoop.hive.serde2.avro.AvroSerdeUtils.determineSchemaOrThrowException(AvroSerdeUtils.java:111) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.determineSchemaOrReturnErrorSchema(AvroSerDe.java:187) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:107) at org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:83) at org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:450) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:437) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:263) at org.apache.hadoop.hive.ql.metadata.Table.getColsInternal(Table.java:641) at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:624) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:831) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:867) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4356) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:354) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:100) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2183) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1839) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1526) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$runHive$1(HiveClientImpl.scala:820) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:291) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:224) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:223) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:273) at org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:800) at org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:787) {noformat} It works before. was: How to reproduce this issue: {code:scala} {code} > Disable validate default values when parsing Avro schemas > - > > Key: SPARK-34512 > URL: https://issues.apache.org/jira/browse/SPARK-34512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce
[jira] [Created] (SPARK-34512) Disable validate default values when parsing Avro schemas
Yuming Wang created SPARK-34512: --- Summary: Disable validate default values when parsing Avro schemas Key: SPARK-34512 URL: https://issues.apache.org/jira/browse/SPARK-34512 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang How to reproduce this issue: {code:scala} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more
[ https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31891: - Assignee: Maxim Gekk > `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if > partition specific location is not exist any more > --- > > Key: SPARK-31891 > URL: https://issues.apache.org/jira/browse/SPARK-31891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Assignee: Maxim Gekk >Priority: Major > > Currently when execute > {code:sql} > ALTER TABLE multipartIdentifier RECOVER PARTITIONS > {code} > It will auto add the partition according to the table root location > structure. > And Spark need to add one more step to check if the existing partition > specific location exists and then delete the partition? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31891) `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if partition specific location is not exist any more
[ https://issues.apache.org/jira/browse/SPARK-31891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31891. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31499 [https://github.com/apache/spark/pull/31499] > `ALTER TABLE multipartIdentifier RECOVER PARTITIONS` should drop partition if > partition specific location is not exist any more > --- > > Key: SPARK-31891 > URL: https://issues.apache.org/jira/browse/SPARK-31891 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Currently when execute > {code:sql} > ALTER TABLE multipartIdentifier RECOVER PARTITIONS > {code} > It will auto add the partition according to the table root location > structure. > And Spark need to add one more step to check if the existing partition > specific location exists and then delete the partition? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
[ https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34508. --- Fix Version/s: 3.2.0 3.1.1 Resolution: Fixed Issue resolved by pull request 31627 [https://github.com/apache/spark/pull/31627] > skip HiveExternalCatalogVersionsSuite if network is down > > > Key: SPARK-34508 > URL: https://issues.apache.org/jira/browse/SPARK-34508 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.1, 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
[ https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34508: - Assignee: Wenchen Fan > skip HiveExternalCatalogVersionsSuite if network is down > > > Key: SPARK-34508 > URL: https://issues.apache.org/jira/browse/SPARK-34508 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34511) Current Security vulnerabilities in spark libraries
[ https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289339#comment-17289339 ] Dongjoon Hyun commented on SPARK-34511: --- Hi, [~emac3060] . This is outdated already. Please see 3.0.2 release note. For example, we are using cocoons-compress 1.20 and Jetty 9.4.34. - [https://spark.apache.org/releases/spark-release-3-0-2.html] Could you update this report based on 3.0.2 or 3.1.0 RC3? > Current Security vulnerabilities in spark libraries > --- > > Key: SPARK-34511 > URL: https://issues.apache.org/jira/browse/SPARK-34511 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.0.1 >Reporter: eoin >Priority: Major > Labels: security > Original Estimate: 168h > Remaining Estimate: 168h > > The following libraries have the following vulnerabilities that will fail > Nexus security scans. They are deemed as threats of level 7 and higher on the > Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies > as the are fixed in subsequent releases. > > com.fasterxml.woodstox : woodstox-core : 5.0.3 * > [https://github.com/FasterXML/woodstox/issues/50] > * [https://github.com/FasterXML/woodstox/issues/51] > * [https://github.com/FasterXML/woodstox/issues/61] > com.nimbusds : nimbus-jose-jwt : 4.41.1 * > [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt] > * [https://connect2id.com/blog/nimbus-jose-jwt-7-9] > Log4j : log4j : 1.2.17 > SocketServer class that is vulnerable to deserialization of untrusted data: * > https://issues.apache.org/jira/browse/LOG4J2-1863 > * > [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] > Dynamic-link Library (DLL) Preloading: > * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323] > > apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> > https://issues.apache.org/jira/browse/XERCESJ-1685 > * > [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176] > > com.fasterxml.jackson.core : jackson-databind : 2.10.0 * > [https://github.com/FasterXML/jackson-databind/issues/2589] > > commons-beanutils : commons-beanutils : 1.9.3 * > [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader] > * https://issues.apache.org/jira/browse/BEANUTILS-463 > > commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52] > * https://issues.apache.org/jira/browse/IO-556 > * https://issues.apache.org/jira/browse/IO-559 > > io.netty : netty-all : 4.1.47.Final * > [https://github.com/netty/netty/issues/10351] > * [https://github.com/netty/netty/pull/10560] > > org.apache.commons : commons-compress : 1.18 * > [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities] > > org.apache.hadoop : hadoop-hdfs : 2.7.4 * > [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E] > * > [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E] > * [https://hadoop.apache.org/cve_list.html] > * [https://www.openwall.com/lists/oss-security/2019/01/24/3] > > org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * > [https://bugzilla.redhat.com/show_bug.cgi?id=1516399] > * > [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E] > > org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * > [https://github.com/FasterXML/jackson-databind/issues/1599] > * [https://blog.sonatype.com/jackson-databind-remote-code-execution] > * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist] > * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525] > * [https://access.redhat.com/security/cve/cve-2019-10172] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075] > * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172] > > org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096] > > org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921] > * [https://github.com/eclipse/jetty.project/issues/5451] > * > [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (SPARK-34511) Current Security vulnerabilities in spark libraries
[ https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34511: -- Flags: (was: Patch) > Current Security vulnerabilities in spark libraries > --- > > Key: SPARK-34511 > URL: https://issues.apache.org/jira/browse/SPARK-34511 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: eoin >Priority: Major > Labels: security > Original Estimate: 168h > Remaining Estimate: 168h > > The following libraries have the following vulnerabilities that will fail > Nexus security scans. They are deemed as threats of level 7 and higher on the > Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies > as the are fixed in subsequent releases. > > com.fasterxml.woodstox : woodstox-core : 5.0.3 * > [https://github.com/FasterXML/woodstox/issues/50] > * [https://github.com/FasterXML/woodstox/issues/51] > * [https://github.com/FasterXML/woodstox/issues/61] > com.nimbusds : nimbus-jose-jwt : 4.41.1 * > [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt] > * [https://connect2id.com/blog/nimbus-jose-jwt-7-9] > Log4j : log4j : 1.2.17 > SocketServer class that is vulnerable to deserialization of untrusted data: * > https://issues.apache.org/jira/browse/LOG4J2-1863 > * > [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] > Dynamic-link Library (DLL) Preloading: > * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323] > > apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> > https://issues.apache.org/jira/browse/XERCESJ-1685 > * > [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176] > > com.fasterxml.jackson.core : jackson-databind : 2.10.0 * > [https://github.com/FasterXML/jackson-databind/issues/2589] > > commons-beanutils : commons-beanutils : 1.9.3 * > [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader] > * https://issues.apache.org/jira/browse/BEANUTILS-463 > > commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52] > * https://issues.apache.org/jira/browse/IO-556 > * https://issues.apache.org/jira/browse/IO-559 > > io.netty : netty-all : 4.1.47.Final * > [https://github.com/netty/netty/issues/10351] > * [https://github.com/netty/netty/pull/10560] > > org.apache.commons : commons-compress : 1.18 * > [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities] > > org.apache.hadoop : hadoop-hdfs : 2.7.4 * > [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E] > * > [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E] > * [https://hadoop.apache.org/cve_list.html] > * [https://www.openwall.com/lists/oss-security/2019/01/24/3] > > org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * > [https://bugzilla.redhat.com/show_bug.cgi?id=1516399] > * > [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E] > > org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * > [https://github.com/FasterXML/jackson-databind/issues/1599] > * [https://blog.sonatype.com/jackson-databind-remote-code-execution] > * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist] > * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525] > * [https://access.redhat.com/security/cve/cve-2019-10172] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075] > * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172] > > org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096] > > org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921] > * [https://github.com/eclipse/jetty.project/issues/5451] > * > [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34511) Current Security vulnerabilities in spark libraries
[ https://issues.apache.org/jira/browse/SPARK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34511: -- Component/s: (was: Spark Core) Build > Current Security vulnerabilities in spark libraries > --- > > Key: SPARK-34511 > URL: https://issues.apache.org/jira/browse/SPARK-34511 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.0.1 >Reporter: eoin >Priority: Major > Labels: security > Original Estimate: 168h > Remaining Estimate: 168h > > The following libraries have the following vulnerabilities that will fail > Nexus security scans. They are deemed as threats of level 7 and higher on the > Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies > as the are fixed in subsequent releases. > > com.fasterxml.woodstox : woodstox-core : 5.0.3 * > [https://github.com/FasterXML/woodstox/issues/50] > * [https://github.com/FasterXML/woodstox/issues/51] > * [https://github.com/FasterXML/woodstox/issues/61] > com.nimbusds : nimbus-jose-jwt : 4.41.1 * > [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt] > * [https://connect2id.com/blog/nimbus-jose-jwt-7-9] > Log4j : log4j : 1.2.17 > SocketServer class that is vulnerable to deserialization of untrusted data: * > https://issues.apache.org/jira/browse/LOG4J2-1863 > * > [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] > Dynamic-link Library (DLL) Preloading: > * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323] > > apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> > https://issues.apache.org/jira/browse/XERCESJ-1685 > * > [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176] > > com.fasterxml.jackson.core : jackson-databind : 2.10.0 * > [https://github.com/FasterXML/jackson-databind/issues/2589] > > commons-beanutils : commons-beanutils : 1.9.3 * > [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader] > * https://issues.apache.org/jira/browse/BEANUTILS-463 > > commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52] > * https://issues.apache.org/jira/browse/IO-556 > * https://issues.apache.org/jira/browse/IO-559 > > io.netty : netty-all : 4.1.47.Final * > [https://github.com/netty/netty/issues/10351] > * [https://github.com/netty/netty/pull/10560] > > org.apache.commons : commons-compress : 1.18 * > [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities] > > org.apache.hadoop : hadoop-hdfs : 2.7.4 * > [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E] > * > [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E] > * [https://hadoop.apache.org/cve_list.html] > * [https://www.openwall.com/lists/oss-security/2019/01/24/3] > > org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * > [https://bugzilla.redhat.com/show_bug.cgi?id=1516399] > * > [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E] > > org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * > [https://github.com/FasterXML/jackson-databind/issues/1599] > * [https://blog.sonatype.com/jackson-databind-remote-code-execution] > * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist] > * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525] > * [https://access.redhat.com/security/cve/cve-2019-10172] > * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075] > * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172] > > org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096] > > org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * > [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921] > * [https://github.com/eclipse/jetty.project/issues/5451] > * > [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Description: I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is what controls the flow of the application and calls code inside the _file_processor_ package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} was: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the _file_processor_ package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is > what controls the flow of the application and calls code inside the > _file_processor_ package. The command hangs when the .foreachPartition code > that is located inside _s3_repo.py_ is called by _process.py_. When the same > .foreachPartition code is moved from _s3_repo.py_ and placed inside the > _process.py_ it runs just fine. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > *process.py* > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > *spark.py* > {code:java} > from py
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Description: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the _file_processor_ package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} was: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py > is what controls the flow of the application and calls code inside the > _file_processor_ package. The command hangs when the .foreachPartition code > that is located inside _s3_repo.py_ is called by _process.py_. When the same > .foreachPartition code is moved from _s3_repo.py_ and placed inside the > _process.py_ it runs just fine. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > *process.py* > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > *spark.py* > {code:java} > from
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Description: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} was: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py > is what controls the flow of the application and calls code inside the > file_processor package. The command hangs when the .foreachPartition code > that is located inside _s3_repo.py_ is called by _process.py_. When the same > .foreachPartition code is moved from _s3_repo.py_ and placed inside the > _process.py_ it runs just fine. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > *process.py* > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > *spark.py* > {code:java} > from py
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Description: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. The command hangs when the .foreachPartition code that is located inside _s3_repo.py_ is called by _process.py_. When the same .foreachPartition code is moved from _s3_repo.py_ and placed inside the _process.py_ it runs just fine. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} *process.py* {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} *spark.py* {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} *s3_repo.py* {code:java} from file_processor.config.spark import spark_session def save_to_s3(): spark_session.sql('SELECT * FROM rawFileData').toJSON().foreachPartition(_save_to_s3) def _save_to_s3(iterator): for record in iterator: print(record) {code} *table_creator.py* {code:java} from file_processor.config.spark import spark_session from pyspark.sql import Row def create_table(): file_contents = [ {'line_num': 1, 'contents': 'line 1'}, {'line_num': 2, 'contents': 'line 2'}, {'line_num': 3, 'contents': 'line 3'} ] spark_session.createDataFrame(Row(**row) for row in file_contents).cache().createOrReplaceTempView("rawFileData") {code} was: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} The command hangs when the .foreachPartition code that is located inside s3_repo.py is called by process.py. When the same .foreachPartition code is moved from s3_repo.py and placed inside the process.py it runs just fine. process.py {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} spark.py {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} s3_repo.py > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py > is what controls the flow of the application and calls code inside the > file_processor package. The command hangs when the .foreachPartition code > that is located inside _s3_repo.py_ is called by _process.py_. When the same > .foreachPartition code is moved from _s3_repo.py_ and placed inside the > _process.py_ it runs just fine. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > *process.py* > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > *spark.py* > {code:java} > from pyspark.sql import SparkSession > spark_session = SparkSession.builder.appName("Test").getOrCreate() > {code} > *s3_repo.py* > {code:java} > from file_processor.config.spark import spark_session > def save_to_s3(): > spark_session.sql('SELECT * FROM > rawFileData').toJSON().foreachPartition(_save_to_s3) > def _save_to_s3(iterator): > for record in iterator: > print(record) > {code} > > *table_creator.py* > {code:java} > from file_processor.config.spark import spark_session > from pyspark.sql import Row > def create_table(): > file_contents = [ > {'line_num': 1, 'contents': 'line 1'}, > {'line_num': 2, 'contents': 'line 2'}, > {'line_num': 3, 'content
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Description: I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py is what controls the flow of the application and calls code inside the file_processor package. {code:java} process.py file_processor config spark.py repository s3_repo.py structure table_creator.py {code} The command hangs when the .foreachPartition code that is located inside s3_repo.py is called by process.py. When the same .foreachPartition code is moved from s3_repo.py and placed inside the process.py it runs just fine. process.py {code:java} from file_processor.structure import table_creator from file_processor.repository import s3_repo def process(): table_creator.create_table() s3_repo.save_to_s3() if __name__ == '__main__': process() {code} spark.py {code:java} from pyspark.sql import SparkSession spark_session = SparkSession.builder.appName("Test").getOrCreate() {code} s3_repo.py was:I provided full description of the issue on Stack Overflow via the following link https://stackoverflow.com/questions/66300313 > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with a project structure below, process.py > is what controls the flow of the application and calls code inside the > file_processor package. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > The command hangs when the .foreachPartition code that is located inside > s3_repo.py is called by process.py. When the same .foreachPartition code is > moved from s3_repo.py and placed inside the process.py it runs just fine. > > process.py > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > spark.py > {code:java} > from pyspark.sql import SparkSession > spark_session = SparkSession.builder.appName("Test").getOrCreate() > {code} > s3_repo.py > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34502) Remove unused parameters in join methods
[ https://issues.apache.org/jira/browse/SPARK-34502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-34502: --- Assignee: Huaxin Gao > Remove unused parameters in join methods > > > Key: SPARK-34502 > URL: https://issues.apache.org/jira/browse/SPARK-34502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > > Remove unused parameters in some join methods -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34502) Remove unused parameters in join methods
[ https://issues.apache.org/jira/browse/SPARK-34502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-34502. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31617 [https://github.com/apache/spark/pull/31617] > Remove unused parameters in join methods > > > Key: SPARK-34502 > URL: https://issues.apache.org/jira/browse/SPARK-34502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Trivial > Fix For: 3.2.0 > > > Remove unused parameters in some join methods -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34380: Assignee: Terry Kim (was: Apache Spark) > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES > > > Key: SPARK-34380 > URL: https://issues.apache.org/jira/browse/SPARK-34380 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34380: Assignee: Apache Spark (was: Terry Kim) > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES > > > Key: SPARK-34380 > URL: https://issues.apache.org/jira/browse/SPARK-34380 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-34380: - > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES > > > Key: SPARK-34380 > URL: https://issues.apache.org/jira/browse/SPARK-34380 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.2.0 > > > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34380) Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES
[ https://issues.apache.org/jira/browse/SPARK-34380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-34380: Fix Version/s: (was: 3.2.0) > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES > > > Key: SPARK-34380 > URL: https://issues.apache.org/jira/browse/SPARK-34380 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Support ifExists for ALTER TABLE ... UNSET TBLPROPERTIES -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34472) SparkContext.addJar with an ivy path fails in cluster mode with a custom ivySettings file
[ https://issues.apache.org/jira/browse/SPARK-34472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289258#comment-17289258 ] Shardul Mahadik commented on SPARK-34472: - [~xkrogen] raised a good point at https://github.com/apache/spark/pull/31591#discussion_r579324686 that we should refactor YarnClusterSuite to extract common parameter handling code to be shared across tests. Will do this as a followup. > SparkContext.addJar with an ivy path fails in cluster mode with a custom > ivySettings file > - > > Key: SPARK-34472 > URL: https://issues.apache.org/jira/browse/SPARK-34472 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Shardul Mahadik >Priority: Major > > SPARK-33084 introduced support for Ivy paths in {{sc.addJar}} or Spark SQL > {{ADD JAR}}. If we use a custom ivySettings file using > {{spark.jars.ivySettings}}, it is loaded at > [https://github.com/apache/spark/blob/b26e7b510bbaee63c4095ab47e75ff2a70e377d7/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L1280.] > However, this file is only accessible on the client machine. In cluster > mode, this file is not available on the driver and so {{addJar}} fails. > {code:sh} > spark-submit --master yarn --deploy-mode cluster --class IvyAddJarExample > --conf spark.jars.ivySettings=/path/to/ivySettings.xml example.jar > {code} > {code} > java.lang.IllegalArgumentException: requirement failed: Ivy settings file > /path/to/ivySettings.xml does not exist > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.deploy.SparkSubmitUtils$.loadIvySettings(SparkSubmit.scala:1331) > at > org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:176) > at > org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:156) > at > org.apache.spark.sql.internal.SessionResourceLoader.resolveJars(SessionState.scala:166) > at > org.apache.spark.sql.hive.HiveSessionResourceLoader.addJar(HiveSessionStateBuilder.scala:133) > at > org.apache.spark.sql.execution.command.AddJarCommand.run(resources.scala:40) > {code} > We should ship the ivySettings file to the driver so that {{addJar}} is able > to find it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34511) Current Security vulnerabilities in spark libraries
eoin created SPARK-34511: Summary: Current Security vulnerabilities in spark libraries Key: SPARK-34511 URL: https://issues.apache.org/jira/browse/SPARK-34511 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 3.0.1 Reporter: eoin The following libraries have the following vulnerabilities that will fail Nexus security scans. They are deemed as threats of level 7 and higher on the Sonatype/Nexus scale. Many of them can be fixed by upgrading the dependencies as the are fixed in subsequent releases. com.fasterxml.woodstox : woodstox-core : 5.0.3 * [https://github.com/FasterXML/woodstox/issues/50] * [https://github.com/FasterXML/woodstox/issues/51] * [https://github.com/FasterXML/woodstox/issues/61] com.nimbusds : nimbus-jose-jwt : 4.41.1 * [https://bitbucket.org/connect2id/nimbus-jose-jwt/src/master/SECURITY-CHANGELOG.txt] * [https://connect2id.com/blog/nimbus-jose-jwt-7-9] Log4j : log4j : 1.2.17 SocketServer class that is vulnerable to deserialization of untrusted data: * https://issues.apache.org/jira/browse/LOG4J2-1863 * [https://lists.apache.org/thread.html/84cc4266238e057b95eb95dfd8b29d46a2592e7672c12c92f68b2917%40%3Cannounce.apache.org%3E] * [https://bugzilla.redhat.com/show_bug.cgi?id=1785616] Dynamic-link Library (DLL) Preloading: * [https://bz.apache.org/bugzilla/show_bug.cgi?id=50323] apache-xerces : xercesImpl : 2.9.1 * hash table collisions -> https://issues.apache.org/jira/browse/XERCESJ-1685 * [https://mail-archives.apache.org/mod_mbox/xerces-j-dev/201410.mbox/%3cof3b40f5f7.e6552a8b-on85257d73.00699ed7-85257d73.006a9...@ca.ibm.com%3E] * [https://bugzilla.redhat.com/show_bug.cgi?id=1019176] com.fasterxml.jackson.core : jackson-databind : 2.10.0 * [https://github.com/FasterXML/jackson-databind/issues/2589] commons-beanutils : commons-beanutils : 1.9.3 * [http://www.rapid7.com/db/modules/exploit/multi/http/struts_code_exec_classloader] * https://issues.apache.org/jira/browse/BEANUTILS-463 commons-io : commons-io : 2.5 * [https://github.com/apache/commons-io/pull/52] * https://issues.apache.org/jira/browse/IO-556 * https://issues.apache.org/jira/browse/IO-559 io.netty : netty-all : 4.1.47.Final * [https://github.com/netty/netty/issues/10351] * [https://github.com/netty/netty/pull/10560] org.apache.commons : commons-compress : 1.18 * [https://commons.apache.org/proper/commons-compress/security-reports.html#Apache_Commons_Compress_Security_Vulnerabilities] org.apache.hadoop : hadoop-hdfs : 2.7.4 * [https://lists.apache.org/thread.html/rca4516b00b55b347905df45e5d0432186248223f30497db87aba8710@%3Cannounce.apache.org%3E] * [https://lists.apache.org/thread.html/caacbbba2dcc1105163f76f3dfee5fbd22e0417e0783212787086378@%3Cgeneral.hadoop.apache.org%3E] * [https://hadoop.apache.org/cve_list.html] * [https://www.openwall.com/lists/oss-security/2019/01/24/3] org.apache.hadoop : hadoop-mapreduce-client-core : 2.7.4 * [https://bugzilla.redhat.com/show_bug.cgi?id=1516399] * [https://lists.apache.org/thread.html/2e16689b44bdd1976b6368c143a4017fc7159d1f2d02a5d54fe9310f@%3Cgeneral.hadoop.apache.org%3E] org.codehaus.jackson : jackson-mapper-asl : 1.9.13 * [https://github.com/FasterXML/jackson-databind/issues/1599] * [https://blog.sonatype.com/jackson-databind-remote-code-execution] * [https://blog.sonatype.com/jackson-databind-the-end-of-the-blacklist] * [https://bugzilla.redhat.com/show_bug.cgi?id=CVE-2017-7525] * [https://access.redhat.com/security/cve/cve-2019-10172] * [https://bugzilla.redhat.com/show_bug.cgi?id=1715075] * [https://nvd.nist.gov/vuln/detail/CVE-2019-10172] org.eclipse.jetty : jetty-http : 9.3.24.v20180605: * [https://bugs.eclipse.org/bugs/show_bug.cgi?id=538096] org.eclipse.jetty : jetty-webapp : 9.3.24.v20180605 * [https://bugs.eclipse.org/bugs/show_bug.cgi?id=567921] * [https://github.com/eclipse/jetty.project/issues/5451] * [https://github.com/eclipse/jetty.project/security/advisories/GHSA-g3wg-6mcf-8jj6] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Summary: .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR (was: .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package) > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I provided full description of the issue on Stack Overflow via the following > link https://stackoverflow.com/questions/66300313 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuriy updated SPARK-34510: -- Attachment: Code.zip > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package > -- > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I provided full description of the issue on Stack Overflow via the following > link https://stackoverflow.com/questions/66300313 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289200#comment-17289200 ] Chao Sun commented on SPARK-33212: -- Thanks for the report [~ouyangxc.zte]. Can you provide more details? such as error messages, stack traces, steps to reproduce the issue, etc? > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package
Yuriy created SPARK-34510: - Summary: .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package Key: SPARK-34510 URL: https://issues.apache.org/jira/browse/SPARK-34510 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 3.0.0 Reporter: Yuriy I provided full description of the issue on Stack Overflow via the following link https://stackoverflow.com/questions/66300313 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S
[ https://issues.apache.org/jira/browse/SPARK-34509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289158#comment-17289158 ] Attila Zsolt Piros commented on SPARK-34509: I am working on this. > Make dynamic allocation upscaling more progressive on K8S > - > > Key: SPARK-34509 > URL: https://issues.apache.org/jira/browse/SPARK-34509 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Currently even a single late pod request stops upscaling. As we have > allocation batch size it would be better to go up to that limit as soon as > possible (if serving pod requests are slow we have to make them as early as > possible). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S
Attila Zsolt Piros created SPARK-34509: -- Summary: Make dynamic allocation upscaling more progressive on K8S Key: SPARK-34509 URL: https://issues.apache.org/jira/browse/SPARK-34509 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.2, 2.4.7, 2.3.4, 3.2.0 Reporter: Attila Zsolt Piros Currently even a single late pod request stops upscaling. As we have allocation batch size it would be better to go up to that limit as soon as possible (if serving pod requests are slow we have to make them as early as possible). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
[ https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289140#comment-17289140 ] Apache Spark commented on SPARK-34508: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31627 > skip HiveExternalCatalogVersionsSuite if network is down > > > Key: SPARK-34508 > URL: https://issues.apache.org/jira/browse/SPARK-34508 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
[ https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34508: Assignee: (was: Apache Spark) > skip HiveExternalCatalogVersionsSuite if network is down > > > Key: SPARK-34508 > URL: https://issues.apache.org/jira/browse/SPARK-34508 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
[ https://issues.apache.org/jira/browse/SPARK-34508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34508: Assignee: Apache Spark > skip HiveExternalCatalogVersionsSuite if network is down > > > Key: SPARK-34508 > URL: https://issues.apache.org/jira/browse/SPARK-34508 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34508) skip HiveExternalCatalogVersionsSuite if network is down
Wenchen Fan created SPARK-34508: --- Summary: skip HiveExternalCatalogVersionsSuite if network is down Key: SPARK-34508 URL: https://issues.apache.org/jira/browse/SPARK-34508 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289113#comment-17289113 ] Yakov Kerzhner edited comment on SPARK-34448 at 2/23/21, 2:54 PM: -- As I said in the description, I do not believe that the starting point should cause this bug; the minimizer should still drift to the proper minimum. I said the fact that the log(odds) was made the starting point seems to suggest that whoever wrote the code believed that the intercept should be close to the log(odds), which is only true if the data is centered. If I had to guess, I would guess that there is something in the objective function that pulls the intercept towards the log(odds). This would be a bug, as the log(odds) is a good approximation for the intercept if and only if the data is centered. For non-centered data, it is completely wrong to have the intercept equal (or be close to) the log(odds). My test shows precisely this, that when the data is not centered, spark still returns an intercept equal to the log(odds) (test 2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct intercept: -4). Indeed, even for centered data, (test 1.b), it returns an intercept almost equal to the log(odds), (test 1.b. log(odds): -3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4). So we need to dig into the objective function, and whether somewhere in there is a term that penalizes the intercept moving away from the log(odds). If there is nothing there of this sort, then a step through of the minimization process should shed some clues as to why the intercept isnt budging from the initial value given. was (Author: ykerzhner): As I said in the description, I do not believe that the starting point should cause this bug; the minimizer should still drift to the proper minimum. I said the fact that the log(odds) was made the starting point seems to suggest that whoever wrote the code believed that the intercept should be close to the log(odds), which is only true if the data is centered. If I had to guess, I would guess that there is something in the objective function that pulls the intercept towards the log(odds). This would be a bug, as the log(odds) is a good approximation for the intercept if and only if the data is centered. For non-centered data, it is completely wrong to have the intercept equal (or be close to) the log(odds). My test shows precisely this, that when the data is not centered, spark still returns an intercept equal to the log(odds) (test 2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct intercept: -4). Indeed, even for centered data, (test 1.b), it returns an intercept almost equal to the log(odds), (test 1.b. log(odds): -3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4). So we need to dig into the objective function, and whether somewhere in there is a term that penalizes the intercept moving away from the log(odds). > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yakov Kerzhner >Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Commented] (SPARK-34168) Support DPP in AQE When the join is Broadcast hash join before applying the AQE rules
[ https://issues.apache.org/jira/browse/SPARK-34168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289116#comment-17289116 ] Apache Spark commented on SPARK-34168: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31625 > Support DPP in AQE When the join is Broadcast hash join before applying the > AQE rules > - > > Key: SPARK-34168 > URL: https://issues.apache.org/jira/browse/SPARK-34168 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.2.0 > > > Both AQE and DPP cannot be applied at the same time. This PR will enable AQE > and DPP when the join is Broadcast hash join at the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289113#comment-17289113 ] Yakov Kerzhner commented on SPARK-34448: As I said in the description, I do not believe that the starting point should cause this bug; the minimizer should still drift to the proper minimum. I said the fact that the log(odds) was made the starting point seems to suggest that whoever wrote the code believed that the intercept should be close to the log(odds), which is only true if the data is centered. If I had to guess, I would guess that there is something in the objective function that pulls the intercept towards the log(odds). This would be a bug, as the log(odds) is a good approximation for the intercept if and only if the data is centered. For non-centered data, it is completely wrong to have the intercept equal (or be close to) the log(odds). My test shows precisely this, that when the data is not centered, spark still returns an intercept equal to the log(odds) (test 2.b, Intercept: -3.5428941035683303, log(odds): -3.542495168380248, correct intercept: -4). Indeed, even for centered data, (test 1.b), it returns an intercept almost equal to the log(odds), (test 1.b. log(odds): -3.9876303002978997 Intercept: -3.987260922443554, correct intercept: -4). So we need to dig into the objective function, and whether somewhere in there is a term that penalizes the intercept moving away from the log(odds). > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yakov Kerzhner >Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34507) Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12
Guillaume Martres created SPARK-34507: - Summary: Spark artefacts built against Scala 2.13 incorrectly depend on Scala 2.12 Key: SPARK-34507 URL: https://issues.apache.org/jira/browse/SPARK-34507 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.2.0 Reporter: Guillaume Martres Snapshots of Spark 3.2 built against Scala 2.13 are available at [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/,] but they seem to depend on Scala 2.12. Specifically if I look at [https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.13/3.2.0-SNAPSHOT/spark-parent_2.13-3.2.0-20210223.010629-29.pom] I see: {code:java} 2.12.10 2.13https://github.com/apache/spark/blob/8f994cbb4a18558c2e81516ef1e339d9c8fa0d41/dev/change-scala-version.sh#L65] needs to be updated to also change the `scala.version` and not just the `scala.binary.version`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289104#comment-17289104 ] Guillaume Martres commented on SPARK-25075: --- I've opened https://issues.apache.org/jira/browse/SPARK-34507. > Build and test Spark against Scala 2.13 > --- > > Key: SPARK-25075 > URL: https://issues.apache.org/jira/browse/SPARK-25075 > Project: Spark > Issue Type: Umbrella > Components: Build, MLlib, Project Infra, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Guillaume Massé >Priority: Major > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.13 milestone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org