[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:scala} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = enableVectorized } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. The codes as follows: {code:scala} // RunTask.scala override def runTask(context: TaskContext): U = { // Deserialize the RDD and the func using the broadcast variables. val threadMXBean = ManagementFactory.getThreadMXBean val deserializeStartTimeNs = System.nanoTime() val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime } else 0L val ser = SparkEnv.get.closureSerializer.newInstance() // the rdd val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime } else 0L func(context, rdd.iterator(partition, context)) } {code} was: The codes of "MyPartitionReaderFactory" : {code:scala} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition:
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Affects Version/s: 3.1.1 > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: lynn >Priority: Major > Attachments: spark-sqlconf-isnull.png > > > The codes of "MyPartitionReaderFactory" : > {code:scala} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import > com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, > MY_VECTORIZED_READER_ENABLED} > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > import org.apache.spark.sql.internal.SQLConf.buildConf > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) > val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = > enableVectorized > } > object MyPartitionReaderFactory { > val MY_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.my.enableVectorizedReader") > .doc("Enables vectorized my source scan.") > .version("1.0.0") > .booleanConf > .createWithDefault(false) > val MY_VECTORIZED_READER_BATCH_SIZE = > buildConf("spark.sql.my.columnarReaderBatchSize") > .doc("The number of rows to include in a my source vectorized reader > batch. The number should " + > "be carefully chosen to minimize overhead and avoid OOMs in reading > data.") > .version("1.0.0") > .intConf > .createWithDefault(4096) > } > {code} > The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter > pass to the MyPartitionReaderFactory is not null. > But when the executor deserialize the RDD, the sqlConf parameter is null. > The codes as follows: > {code:scala} > // RunTask.scala > override def runTask(context: TaskContext): U = { > // Deserialize the RDD and the func using the broadcast variables. > val threadMXBean = ManagementFactory.getThreadMXBean > val deserializeStartTimeNs = System.nanoTime() > val deserializeStartCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime > } else 0L > val ser = SparkEnv.get.closureSerializer.newInstance() > val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => > U)]( > ByteBuffer.wrap(taskBinary.value), > Thread.currentThread.getContextClassLoader) > _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs > _executorDeserializeCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime > } else 0L > func(context, rdd.iterator(partition, context)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Attachment: (was: znbase-sqlconf-isnull.png) > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > Attachments: spark-sqlconf-isnull.png > > > The codes of "MyPartitionReaderFactory" : > {code:scala} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import > com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, > MY_VECTORIZED_READER_ENABLED} > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > import org.apache.spark.sql.internal.SQLConf.buildConf > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) > val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = > enableVectorized > } > object MyPartitionReaderFactory { > val MY_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.my.enableVectorizedReader") > .doc("Enables vectorized my source scan.") > .version("1.0.0") > .booleanConf > .createWithDefault(false) > val MY_VECTORIZED_READER_BATCH_SIZE = > buildConf("spark.sql.my.columnarReaderBatchSize") > .doc("The number of rows to include in a my source vectorized reader > batch. The number should " + > "be carefully chosen to minimize overhead and avoid OOMs in reading > data.") > .version("1.0.0") > .intConf > .createWithDefault(4096) > } > {code} > The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter > pass to the MyPartitionReaderFactory is not null. > But when the executor deserialize the RDD, the sqlConf parameter is null. > The codes as follows: > {code:scala} > // RunTask.scala > override def runTask(context: TaskContext): U = { > // Deserialize the RDD and the func using the broadcast variables. > val threadMXBean = ManagementFactory.getThreadMXBean > val deserializeStartTimeNs = System.nanoTime() > val deserializeStartCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime > } else 0L > val ser = SparkEnv.get.closureSerializer.newInstance() > val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => > U)]( > ByteBuffer.wrap(taskBinary.value), > Thread.currentThread.getContextClassLoader) > _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs > _executorDeserializeCpuTime = if > (threadMXBean.isCurrentThreadCpuTimeSupported) { > threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime > } else 0L > func(context, rdd.iterator(partition, context)) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:scala} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = enableVectorized } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. The codes as follows: {code:scala} // RunTask.scala override def runTask(context: TaskContext): U = { // Deserialize the RDD and the func using the broadcast variables. val threadMXBean = ManagementFactory.getThreadMXBean val deserializeStartTimeNs = System.nanoTime() val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime } else 0L val ser = SparkEnv.get.closureSerializer.newInstance() val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) { threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime } else 0L func(context, rdd.iterator(partition, context)) } {code} was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) =
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Attachment: spark-sqlconf-isnull.png > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > Attachments: spark-sqlconf-isnull.png, znbase-sqlconf-isnull.png > > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import > com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, > MY_VECTORIZED_READER_ENABLED} > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > import org.apache.spark.sql.internal.SQLConf.buildConf > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) > val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = > enableVectorized > } > object MyPartitionReaderFactory { > val MY_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.my.enableVectorizedReader") > .doc("Enables vectorized my source scan.") > .version("1.0.0") > .booleanConf > .createWithDefault(false) > val MY_VECTORIZED_READER_BATCH_SIZE = > buildConf("spark.sql.my.columnarReaderBatchSize") > .doc("The number of rows to include in a my source vectorized reader > batch. The number should " + > "be carefully chosen to minimize overhead and avoid OOMs in reading > data.") > .version("1.0.0") > .intConf > .createWithDefault(4096) > } > {code} > The driver construct a RDD instance, the sqlConf parameter pass to the > MyPartitionReaderFactory is not null. > But when the executor deserialize the RDD, the sqlConf parameter is null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Attachment: znbase-sqlconf-isnull.png > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > Attachments: spark-sqlconf-isnull.png, znbase-sqlconf-isnull.png > > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import > com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, > MY_VECTORIZED_READER_ENABLED} > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > import org.apache.spark.sql.internal.SQLConf.buildConf > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) > val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = > enableVectorized > } > object MyPartitionReaderFactory { > val MY_VECTORIZED_READER_ENABLED = > buildConf("spark.sql.my.enableVectorizedReader") > .doc("Enables vectorized my source scan.") > .version("1.0.0") > .booleanConf > .createWithDefault(false) > val MY_VECTORIZED_READER_BATCH_SIZE = > buildConf("spark.sql.my.columnarReaderBatchSize") > .doc("The number of rows to include in a my source vectorized reader > batch. The number should " + > "be carefully chosen to minimize overhead and avoid OOMs in reading > data.") > .version("1.0.0") > .intConf > .createWithDefault(4096) > } > {code} > The driver construct a RDD instance, the sqlConf parameter pass to the > MyPartitionReaderFactory is not null. > But when the executor deserialize the RDD, the sqlConf parameter is null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34878) Test actual size of year-month and day-time intervals
[ https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-34878. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32366 [https://github.com/apache/spark/pull/32366] > Test actual size of year-month and day-time intervals > - > > Key: SPARK-34878 > URL: https://issues.apache.org/jira/browse/SPARK-34878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: PengLei >Priority: Major > Fix For: 3.2.0 > > > Add tests to > https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals
[ https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-34878: Assignee: PengLei > Test actual size of year-month and day-time intervals > - > > Key: SPARK-34878 > URL: https://issues.apache.org/jira/browse/SPARK-34878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: PengLei >Priority: Major > > Add tests to > https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334443#comment-17334443 ] Manu Zhang commented on SPARK-32083: This issue has been fixed in https://issues.apache.org/jira/browse/SPARK-35239 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = enableVectorized } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - >
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE, MY_VECTORIZED_READER_ENABLED} import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch import org.apache.spark.sql.internal.SQLConf.buildConf case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components:
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false) val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096) override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } object MyPartitionReaderFactory { val MY_VECTORIZED_READER_ENABLED = buildConf("spark.sql.my.enableVectorizedReader") .doc("Enables vectorized my source scan.") .version("1.0.0") .booleanConf .createWithDefault(false) val MY_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.my.columnarReaderBatchSize") .doc("The number of rows to include in a my source vectorized reader batch. The number should " + "be carefully chosen to minimize overhead and avoid OOMs in reading data.") .version("1.0.0") .intConf .createWithDefault(4096) } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { //TODO: delete this line println(sqlConf.getAllConfs) def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > val
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { //TODO: delete this line println(sqlConf.getAllConfs) def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > //TODO: delete this line > println(sqlConf.getAllConfs) > > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} > The driver construct a
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") >MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} > The driver construct a RDD instance, the sqlConf parameter pass to the >
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" : {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} The driver construct a RDD instance, the sqlConf parameter pass to the MyPartitionReaderFactory is not null. But when the executor deserialize the RDD, the sqlConf parameter is null. was: The codes of "MyPartitionReaderFactory" of {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of "MyPartitionReaderFactory" : > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") > //TODO: delete this line > println(sqlConf.getAllConfs) > MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} > The driver construct a RDD instance, the sqlConf parameter pass to the > MyPartitionReaderFactory is not null. > But when the executor deserialize the
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of "MyPartitionReaderFactory" of {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} was: The codes of {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of "MyPartitionReaderFactory" of > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") > //TODO: delete this line > println(sqlConf.getAllConfs) > MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Summary: PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null (was: PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null) > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter > is null > - > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") > //TODO: delete this line > println(sqlConf.getAllConfs) > MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null
[ https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lynn updated SPARK-35252: - Description: The codes of {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} was: {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} > PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi > parameter is null > -- > > Key: SPARK-35252 > URL: https://issues.apache.org/jira/browse/SPARK-35252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: lynn >Priority: Major > > The codes of > {code:java} > // Implemention Class > package com.lynn.spark.sql.v2 > import org.apache.spark.internal.Logging > import org.apache.spark.sql.catalyst.InternalRow > import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, > PartitionReaderFactory} > import org.apache.spark.sql.internal.SQLConf > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.vectorized.ColumnarBatch > case class MyPartitionReaderFactory(sqlConf: SQLConf, > dataSchema: StructType, > readSchema: StructType) > extends PartitionReaderFactory with Logging { > def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") > override def createReader(partition: InputPartition): > PartitionReader[InternalRow] = { > MyRowReader(batchSize, dataSchema, readSchema) > } > override def createColumnarReader(partition: InputPartition): > PartitionReader[ColumnarBatch] = { > if(!supportColumnarReads(partition)) > throw new UnsupportedOperationException("Cannot create columnar > reader.") > //TODO: delete this line > println(sqlConf.getAllConfs) > MyColumnReader(batchSize, dataSchema, readSchema) > } > override def supportColumnarReads(partition: InputPartition) = true > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null
lynn created SPARK-35252: Summary: PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null Key: SPARK-35252 URL: https://issues.apache.org/jira/browse/SPARK-35252 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Reporter: lynn {code:java} // Implemention Class package com.lynn.spark.sql.v2 import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, PartitionReaderFactory} import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.StructType import org.apache.spark.sql.vectorized.ColumnarBatch case class MyPartitionReaderFactory(sqlConf: SQLConf, dataSchema: StructType, readSchema: StructType) extends PartitionReaderFactory with Logging { def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size") override def createReader(partition: InputPartition): PartitionReader[InternalRow] = { MyRowReader(batchSize, dataSchema, readSchema) } override def createColumnarReader(partition: InputPartition): PartitionReader[ColumnarBatch] = { if(!supportColumnarReads(partition)) throw new UnsupportedOperationException("Cannot create columnar reader.") //TODO: delete this line println(sqlConf.getAllConfs) MyColumnReader(batchSize, dataSchema, readSchema) } override def supportColumnarReads(partition: InputPartition) = true } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35251) Improve LiveEntityHelpers.newAccumulatorInfos performace
Yuming Wang created SPARK-35251: --- Summary: Improve LiveEntityHelpers.newAccumulatorInfos performace Key: SPARK-35251 URL: https://issues.apache.org/jira/browse/SPARK-35251 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Yuming Wang [It|https://github.com/apache/spark/blob/3854ad87c78f2a331f9c9c1a34f9ec281900f8fe/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L646-L661] will impact performance if there are lot of {{AccumulableInfo}} instances. {noformat} num #instances #bytes class name -- 1: 33189785050708371592 [C 2:79535824591381896 [J 3: 33083623410586759488 java.lang.String 4: 139726297 8942483008 org.apache.spark.sql.execution.metric.SQLMetric 5: 249040156 5976963744 scala.Some 6: 146278719 5851148760 org.apache.spark.util.AccumulatorMetadata 7: 38448113 5536528272 java.net.URI 8: 37540162 360382 org.apache.hadoop.fs.FileStatus 9: 69724130 3346758240 java.util.Hashtable$Entry 10: 61521559 2953034832 java.util.concurrent.ConcurrentHashMap$Node 11: 50421974 2823630544 scala.collection.mutable.LinkedEntry 12: 43349222 2774350208 org.apache.spark.scheduler.AccumulableInfo {noformat} {noformat} --- 15430388364 ns (2.03%), 1543 samples [ 0] scala.collection.TraversableLike.noneIn$1 [ 1] scala.collection.TraversableLike.filterImpl [ 2] scala.collection.TraversableLike.filterImpl$ [ 3] scala.collection.AbstractTraversable.filterImpl [ 4] scala.collection.TraversableLike.filter [ 5] scala.collection.TraversableLike.filter$ [ 6] scala.collection.AbstractTraversable.filter [ 7] org.apache.spark.status.LiveEntityHelpers$.newAccumulatorInfos [ 8] org.apache.spark.status.LiveTask.doUpdate [ 9] org.apache.spark.status.LiveEntity.write {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35250) SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented
[ https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim McNamara updated SPARK-35250: - Summary: SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented (was: SQL DataFrameReader mode is misdocumented) > SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented > - > > Key: SPARK-35250 > URL: https://issues.apache.org/jira/browse/SPARK-35250 > Project: Spark > Issue Type: Documentation > Components: docs, Documentation >Affects Versions: 3.1.2 > Environment: >Reporter: Tim McNamara >Priority: Major > Labels: GoodForNewContributors, easy-fix > Original Estimate: 1h > Remaining Estimate: 1h > > The unescapedQuoteHandling parameter of DataFrameReader isn't correctly > documented. STOP_AT_DELIMITER appears twice, and it looks like that's > overwritten an intended option, e.g. > [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] > To view instances where this error occurs, this is a useful query > [https://github.com/apache/spark/search?q=STOP_AT_DELIMITER] > It appears that this bug was introduced here: > [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-35248: - Target Version/s: (was: 3.1.2) > Spark shall load system class first in IsolatedClientLoader > --- > > Key: SPARK-35248 > URL: https://issues.apache.org/jira/browse/SPARK-35248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.1.1 >Reporter: Daniel Dai >Priority: Major > > This happens when Spark try to load HMS client jars using > IsolatedClientLoader, in particular, when > "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a > conflicting thrift lib in user jar, which results in exception: > {code:java} > 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: > User class threw exception: org.apache.spark.sql.AnalysisException: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at >
[jira] [Resolved] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker
[ https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35246. -- Assignee: Jose Torres Resolution: Fixed Fixed in [https://github.com/apache/spark/pull/32371] > Streaming-batch intersects are incorrectly allowed through > UnsupportedOperationsChecker > --- > > Key: SPARK-35246 > URL: https://issues.apache.org/jira/browse/SPARK-35246 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.2.0 > > > The UnsupportedOperationChecker currently rejects streaming intersects only > if both sides are streaming, but they don't work if even one side is > streaming. The following simple test, for example, fails with a cryptic > None.get error because the state store can't plan itself properly. > {code:java} > test("intersect") { > val input = MemoryStream[Long] > val df = input.toDS().intersect(spark.range(10).as[Long]) > testStream(df) ( > AddData(input, 1L), > CheckAnswer(1) > ) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35244. -- Fix Version/s: 3.2.0 Resolution: Fixed [https://github.com/apache/spark/pull/32370] > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35250) SQL DataFrameReader mode is misdocumented
[ https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim McNamara updated SPARK-35250: - Description: The unescapedQuoteHandling parameter of DataFrameReader isn't correctly documented. STOP_AT_DELIMITER appears twice, and it looks like that's overwritten an intended option, e.g. [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] To view instances where this error occurs, this is a useful query https://github.com/apache/spark/search?q=STOP_AT_DELIMITER It appears that this bug was introduced here: [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] was: The unescapedQuoteHandling parameter of DataFrameReader isn't correctly documented. STOP_AT_DELIMITER appears twice, and it looks like that's overwritten an intended option. e.g. * [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] It appears that this bug was introduced here: [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] Environment: (was: ) > SQL DataFrameReader mode is misdocumented > - > > Key: SPARK-35250 > URL: https://issues.apache.org/jira/browse/SPARK-35250 > Project: Spark > Issue Type: Documentation > Components: docs, Documentation >Affects Versions: 3.1.2 > Environment: >Reporter: Tim McNamara >Priority: Major > Labels: GoodForNewContributors, easy-fix > Original Estimate: 1h > Remaining Estimate: 1h > > The unescapedQuoteHandling parameter of DataFrameReader isn't correctly > documented. STOP_AT_DELIMITER appears twice, and it looks like that's > overwritten an intended option, e.g. > [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] > To view instances where this error occurs, this is a useful query > https://github.com/apache/spark/search?q=STOP_AT_DELIMITER > > It appears that this bug was introduced here: > [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35250) SQL DataFrameReader mode is misdocumented
[ https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim McNamara updated SPARK-35250: - Description: The unescapedQuoteHandling parameter of DataFrameReader isn't correctly documented. STOP_AT_DELIMITER appears twice, and it looks like that's overwritten an intended option, e.g. [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] To view instances where this error occurs, this is a useful query [https://github.com/apache/spark/search?q=STOP_AT_DELIMITER] It appears that this bug was introduced here: [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] was: The unescapedQuoteHandling parameter of DataFrameReader isn't correctly documented. STOP_AT_DELIMITER appears twice, and it looks like that's overwritten an intended option, e.g. [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] To view instances where this error occurs, this is a useful query https://github.com/apache/spark/search?q=STOP_AT_DELIMITER It appears that this bug was introduced here: [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] > SQL DataFrameReader mode is misdocumented > - > > Key: SPARK-35250 > URL: https://issues.apache.org/jira/browse/SPARK-35250 > Project: Spark > Issue Type: Documentation > Components: docs, Documentation >Affects Versions: 3.1.2 > Environment: >Reporter: Tim McNamara >Priority: Major > Labels: GoodForNewContributors, easy-fix > Original Estimate: 1h > Remaining Estimate: 1h > > The unescapedQuoteHandling parameter of DataFrameReader isn't correctly > documented. STOP_AT_DELIMITER appears twice, and it looks like that's > overwritten an intended option, e.g. > [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] > To view instances where this error occurs, this is a useful query > [https://github.com/apache/spark/search?q=STOP_AT_DELIMITER] > It appears that this bug was introduced here: > [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35250) SQL DataFrameReader mode is misdocumented
Tim McNamara created SPARK-35250: Summary: SQL DataFrameReader mode is misdocumented Key: SPARK-35250 URL: https://issues.apache.org/jira/browse/SPARK-35250 Project: Spark Issue Type: Documentation Components: docs, Documentation Affects Versions: 3.1.2 Environment: Reporter: Tim McNamara The unescapedQuoteHandling parameter of DataFrameReader isn't correctly documented. STOP_AT_DELIMITER appears twice, and it looks like that's overwritten an intended option. e.g. * [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749] It appears that this bug was introduced here: [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35236) Support archive files as resources for CREATE FUNCTION USING syntax
[ https://issues.apache.org/jira/browse/SPARK-35236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35236. -- Fix Version/s: 3.2.0 Resolution: Fixed Fixed in [https://github.com/apache/spark/pull/32359] > Support archive files as resources for CREATE FUNCTION USING syntax > --- > > Key: SPARK-35236 > URL: https://issues.apache.org/jira/browse/SPARK-35236 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > CREATE FUNCTION USING syntax doesn't support archives as resources. > The reason is archives were not supported in Spark SQL. > Now Spark SQL supports archives so I think we can support them for the syntax. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35243) Support columnar execution on ANSI interval types
[ https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334393#comment-17334393 ] PengLei commented on SPARK-35243: - Working on this > Support columnar execution on ANSI interval types > - > > Key: SPARK-35243 > URL: https://issues.apache.org/jira/browse/SPARK-35243 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > See SPARK-30066 as reference implementation for CalendarIntervalType () -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35249) to_timestamp can't parse 6 digit microsecond SSSSSS
t oo created SPARK-35249: Summary: to_timestamp can't parse 6 digit microsecond SS Key: SPARK-35249 URL: https://issues.apache.org/jira/browse/SPARK-35249 Project: Spark Issue Type: Wish Components: SQL Affects Versions: 2.4.6 Reporter: t oo spark-sql> select x, to_timestamp(x,"MMM dd hh:mm:ss.SS") from (select 'Apr 13 2021 12:00:00.001000AM' x); Apr 13 2021 12:00:00.001000AM NULL Why doesn't the to_timestamp work? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34979. -- Fix Version/s: 3.2.0 Assignee: Yikun Jiang Resolution: Fixed Fixed in [https://github.com/apache/spark/pull/32363] > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: Documentation, PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Trivial > Fix For: 3.2.0 > > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333555#comment-17333555 ] Apache Spark commented on SPARK-35248: -- User 'daijyc' has created a pull request for this issue: https://github.com/apache/spark/pull/32373 > Spark shall load system class first in IsolatedClientLoader > --- > > Key: SPARK-35248 > URL: https://issues.apache.org/jira/browse/SPARK-35248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.1.1 >Reporter: Daniel Dai >Priority: Major > > This happens when Spark try to load HMS client jars using > IsolatedClientLoader, in particular, when > "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a > conflicting thrift lib in user jar, which results in exception: > {code:java} > 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: > User class threw exception: org.apache.spark.sql.AnalysisException: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at >
[jira] [Commented] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333554#comment-17333554 ] Apache Spark commented on SPARK-35248: -- User 'daijyc' has created a pull request for this issue: https://github.com/apache/spark/pull/32373 > Spark shall load system class first in IsolatedClientLoader > --- > > Key: SPARK-35248 > URL: https://issues.apache.org/jira/browse/SPARK-35248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.1.1 >Reporter: Daniel Dai >Priority: Major > > This happens when Spark try to load HMS client jars using > IsolatedClientLoader, in particular, when > "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a > conflicting thrift lib in user jar, which results in exception: > {code:java} > 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: > User class threw exception: org.apache.spark.sql.AnalysisException: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at >
[jira] [Assigned] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35248: Assignee: (was: Apache Spark) > Spark shall load system class first in IsolatedClientLoader > --- > > Key: SPARK-35248 > URL: https://issues.apache.org/jira/browse/SPARK-35248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.1.1 >Reporter: Daniel Dai >Priority: Major > > This happens when Spark try to load HMS client jars using > IsolatedClientLoader, in particular, when > "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a > conflicting thrift lib in user jar, which results in exception: > {code:java} > 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: > User class threw exception: org.apache.spark.sql.AnalysisException: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at >
[jira] [Assigned] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
[ https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35248: Assignee: Apache Spark > Spark shall load system class first in IsolatedClientLoader > --- > > Key: SPARK-35248 > URL: https://issues.apache.org/jira/browse/SPARK-35248 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.1.1 >Reporter: Daniel Dai >Assignee: Apache Spark >Priority: Major > > This happens when Spark try to load HMS client jars using > IsolatedClientLoader, in particular, when > "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a > conflicting thrift lib in user jar, which results in exception: > {code:java} > 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: > User class threw exception: org.apache.spark.sql.AnalysisException: > java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) > at > org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) > at > org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) > at > org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at >
[jira] [Created] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader
Daniel Dai created SPARK-35248: -- Summary: Spark shall load system class first in IsolatedClientLoader Key: SPARK-35248 URL: https://issues.apache.org/jira/browse/SPARK-35248 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 2.4.7 Reporter: Daniel Dai This happens when Spark try to load HMS client jars using IsolatedClientLoader, in particular, when "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a conflicting thrift lib in user jar, which results in exception: {code:java} 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102) at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141) at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87) at
[jira] [Commented] (SPARK-35247) Add `mask` built-in function as a data masking function
[ https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333548#comment-17333548 ] Apache Spark commented on SPARK-35247: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32372 > Add `mask` built-in function as a data masking function > --- > > Key: SPARK-35247 > URL: https://issues.apache.org/jira/browse/SPARK-35247 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there are no built-in data masking function. > Hive already has such functions so it's great to have. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35247) Add `mask` built-in function as a data masking function
[ https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35247: Assignee: Kousuke Saruta (was: Apache Spark) > Add `mask` built-in function as a data masking function > --- > > Key: SPARK-35247 > URL: https://issues.apache.org/jira/browse/SPARK-35247 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there are no built-in data masking function. > Hive already has such functions so it's great to have. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35247) Add `mask` built-in function as a data masking function
[ https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35247: Assignee: Apache Spark (was: Kousuke Saruta) > Add `mask` built-in function as a data masking function > --- > > Key: SPARK-35247 > URL: https://issues.apache.org/jira/browse/SPARK-35247 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > In the current master, there are no built-in data masking function. > Hive already has such functions so it's great to have. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35247) Add `mask` built-in function as a data masking function
Kousuke Saruta created SPARK-35247: -- Summary: Add `mask` built-in function as a data masking function Key: SPARK-35247 URL: https://issues.apache.org/jira/browse/SPARK-35247 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, there are no built-in data masking function. Hive already has such functions so it's great to have. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35247) Add `mask` built-in function as a data masking function
[ https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-35247: --- Issue Type: New Feature (was: Bug) > Add `mask` built-in function as a data masking function > --- > > Key: SPARK-35247 > URL: https://issues.apache.org/jira/browse/SPARK-35247 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > In the current master, there are no built-in data masking function. > Hive already has such functions so it's great to have. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35150) Accelerate fallback BLAS with dev.ludovic.netlib
[ https://issues.apache.org/jira/browse/SPARK-35150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-35150: Assignee: Ludovic Henry > Accelerate fallback BLAS with dev.ludovic.netlib > > > Key: SPARK-35150 > URL: https://issues.apache.org/jira/browse/SPARK-35150 > Project: Spark > Issue Type: Improvement > Components: GraphX, ML, MLlib >Affects Versions: 3.2.0 >Reporter: Ludovic Henry >Assignee: Ludovic Henry >Priority: Major > > Following https://github.com/apache/spark/pull/30810, I've continued looking > for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate > work done in the [{{dev.ludovic.netlib}}|https://github.com/luhenry/netlib/] > Maven package. > The {{dev.ludovic.netlib}} library wraps the original > {{com.github.fommil.netlib}} library and focus on accelerating the linear > algebra routines in use in Spark. When running the > {{org.apache.spark.ml.linalg.BLASBenchmark}}benchmarking suite, I get the > results at [1] on an Intel machine. Moreover, this library is thoroughly > tested to return the exact same results as the reference implementation. > Under the hood, it reimplements the necessary algorithms in pure > autovectorization-friendly Java 8, as well as takes advantage of the Vector > API and Foreign Linker API introduced in JDK 16 when available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35150) Accelerate fallback BLAS with dev.ludovic.netlib
[ https://issues.apache.org/jira/browse/SPARK-35150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-35150. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32253 [https://github.com/apache/spark/pull/32253] > Accelerate fallback BLAS with dev.ludovic.netlib > > > Key: SPARK-35150 > URL: https://issues.apache.org/jira/browse/SPARK-35150 > Project: Spark > Issue Type: Improvement > Components: GraphX, ML, MLlib >Affects Versions: 3.2.0 >Reporter: Ludovic Henry >Assignee: Ludovic Henry >Priority: Major > Fix For: 3.2.0 > > > Following https://github.com/apache/spark/pull/30810, I've continued looking > for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate > work done in the [{{dev.ludovic.netlib}}|https://github.com/luhenry/netlib/] > Maven package. > The {{dev.ludovic.netlib}} library wraps the original > {{com.github.fommil.netlib}} library and focus on accelerating the linear > algebra routines in use in Spark. When running the > {{org.apache.spark.ml.linalg.BLASBenchmark}}benchmarking suite, I get the > results at [1] on an Intel machine. Moreover, this library is thoroughly > tested to return the exact same results as the reference implementation. > Under the hood, it reimplements the necessary algorithms in pure > autovectorization-friendly Java 8, as well as takes advantage of the Vector > API and Foreign Linker API introduced in JDK 16 when available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34979: - Component/s: Documentation > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: Documentation, PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Trivial > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34979: - Priority: Trivial (was: Major) > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Trivial > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker
[ https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35246: Assignee: (was: Apache Spark) > Streaming-batch intersects are incorrectly allowed through > UnsupportedOperationsChecker > --- > > Key: SPARK-35246 > URL: https://issues.apache.org/jira/browse/SPARK-35246 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jose Torres >Priority: Major > Fix For: 3.2.0 > > > The UnsupportedOperationChecker currently rejects streaming intersects only > if both sides are streaming, but they don't work if even one side is > streaming. The following simple test, for example, fails with a cryptic > None.get error because the state store can't plan itself properly. > {code:java} > test("intersect") { > val input = MemoryStream[Long] > val df = input.toDS().intersect(spark.range(10).as[Long]) > testStream(df) ( > AddData(input, 1L), > CheckAnswer(1) > ) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker
[ https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35246: Assignee: Apache Spark > Streaming-batch intersects are incorrectly allowed through > UnsupportedOperationsChecker > --- > > Key: SPARK-35246 > URL: https://issues.apache.org/jira/browse/SPARK-35246 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > The UnsupportedOperationChecker currently rejects streaming intersects only > if both sides are streaming, but they don't work if even one side is > streaming. The following simple test, for example, fails with a cryptic > None.get error because the state store can't plan itself properly. > {code:java} > test("intersect") { > val input = MemoryStream[Long] > val df = input.toDS().intersect(spark.range(10).as[Long]) > testStream(df) ( > AddData(input, 1L), > CheckAnswer(1) > ) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker
[ https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333404#comment-17333404 ] Apache Spark commented on SPARK-35246: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/32371 > Streaming-batch intersects are incorrectly allowed through > UnsupportedOperationsChecker > --- > > Key: SPARK-35246 > URL: https://issues.apache.org/jira/browse/SPARK-35246 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jose Torres >Priority: Major > Fix For: 3.2.0 > > > The UnsupportedOperationChecker currently rejects streaming intersects only > if both sides are streaming, but they don't work if even one side is > streaming. The following simple test, for example, fails with a cryptic > None.get error because the state store can't plan itself properly. > {code:java} > test("intersect") { > val input = MemoryStream[Long] > val df = input.toDS().intersect(spark.range(10).as[Long]) > testStream(df) ( > AddData(input, 1L), > CheckAnswer(1) > ) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker
[ https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jose Torres updated SPARK-35246: Summary: Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker (was: Disable intersects for all streaming queries) > Streaming-batch intersects are incorrectly allowed through > UnsupportedOperationsChecker > --- > > Key: SPARK-35246 > URL: https://issues.apache.org/jira/browse/SPARK-35246 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jose Torres >Priority: Major > Fix For: 3.2.0 > > > The UnsupportedOperationChecker currently rejects streaming intersects only > if both sides are streaming, but they don't work if even one side is > streaming. The following simple test, for example, fails with a cryptic > None.get error because the state store can't plan itself properly. > {code:java} > test("intersect") { > val input = MemoryStream[Long] > val df = input.toDS().intersect(spark.range(10).as[Long]) > testStream(df) ( > AddData(input, 1L), > CheckAnswer(1) > ) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35246) Disable intersects for all streaming queries
Jose Torres created SPARK-35246: --- Summary: Disable intersects for all streaming queries Key: SPARK-35246 URL: https://issues.apache.org/jira/browse/SPARK-35246 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.0, 3.0.0 Reporter: Jose Torres Fix For: 3.2.0 The UnsupportedOperationChecker currently rejects streaming intersects only if both sides are streaming, but they don't work if even one side is streaming. The following simple test, for example, fails with a cryptic None.get error because the state store can't plan itself properly. {code:java} test("intersect") { val input = MemoryStream[Long] val df = input.toDS().intersect(spark.range(10).as[Long]) testStream(df) ( AddData(input, 1L), CheckAnswer(1) ) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35245) DynamicFilter pushdown not working
jean-claude created SPARK-35245: --- Summary: DynamicFilter pushdown not working Key: SPARK-35245 URL: https://issues.apache.org/jira/browse/SPARK-35245 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.1 Reporter: jean-claude The pushed filters is always empty. `PushedFilters: []` I was expecting the filters to be pushed down on the probe side of the join. Not sure how to properly configure this to work. For example how to set fallbackFilterRatio ? spark = ( SparkSession.builder .master('local') .config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", 100) .getOrCreate() ) df = spark.read.parquet('abfss://warehouse@/iceberg/opensource//data/timeperiod=2021-04-25/0-0-929b48ef-7ec3-47bd-b0a1-e9172c2dca6a-1.parquet') df.createOrReplaceTempView('TMP') spark.sql(''' explain cost select timeperiod, rrname from TMP where timeperiod in ( select TO_DATE(d, 'MM-dd-') AS timeperiod from values ('01-01-2021'), ('01-01-2021'), ('01-01-2021') tbl(d) ) group by timeperiod, rrname ''').show(truncate=False) |== Optimized Logical Plan == Aggregate [timeperiod#597, rrname#594], [timeperiod#597, rrname#594], Statistics(sizeInBytes=69.0 MiB) +- Join LeftSemi, (timeperiod#597 = timeperiod#669), Statistics(sizeInBytes=69.0 MiB) :- Project [rrname#594, timeperiod#597], Statistics(sizeInBytes=69.0 MiB) : +- Relation[count#591,time_first#592L,time_last#593L,rrname#594,rrtype#595,rdata#596,timeperiod#597] parquet, Statistics(sizeInBytes=198.5 MiB) +- LocalRelation [timeperiod#669], Statistics(sizeInBytes=36.0 B) == Physical Plan == *(2) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], output=[timeperiod#597, rrname#594]) +- Exchange hashpartitioning(timeperiod#597, rrname#594, 200), ENSURE_REQUIREMENTS, [id=#839] +- *(1) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], output=[timeperiod#597, rrname#594]) +- *(1) BroadcastHashJoin [timeperiod#597], [timeperiod#669], LeftSemi, BuildRight, false :- *(1) ColumnarToRow : +- FileScan parquet [rrname#594,timeperiod#597] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[abfss://warehouse@/iceberg/opensource/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, date, true]),false), [id=#822] +- LocalTableScan [timeperiod#669] ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35244: Assignee: Wenchen Fan (was: Apache Spark) > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35244: Assignee: Apache Spark (was: Wenchen Fan) > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35244: Assignee: Apache Spark (was: Wenchen Fan) > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35244) invoke should throw the original exception
[ https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1756#comment-1756 ] Apache Spark commented on SPARK-35244: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32370 > invoke should throw the original exception > -- > > Key: SPARK-35244 > URL: https://issues.apache.org/jira/browse/SPARK-35244 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35244) invoke should throw the original exception
Wenchen Fan created SPARK-35244: --- Summary: invoke should throw the original exception Key: SPARK-35244 URL: https://issues.apache.org/jira/browse/SPARK-35244 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.0.2, 3.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35238) Add JindoFS SDK in cloud integration documents
[ https://issues.apache.org/jira/browse/SPARK-35238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-35238. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32360 [https://github.com/apache/spark/pull/32360] > Add JindoFS SDK in cloud integration documents > -- > > Key: SPARK-35238 > URL: https://issues.apache.org/jira/browse/SPARK-35238 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.1.1 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 3.2.0 > > > As an important cloud provider, Alibaba Cloud presents JindoFS SDK to > maximize the performance for workloads interacting with Alibaba Cloud OSS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35238) Add JindoFS SDK in cloud integration documents
[ https://issues.apache.org/jira/browse/SPARK-35238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-35238: Assignee: Adrian Wang > Add JindoFS SDK in cloud integration documents > -- > > Key: SPARK-35238 > URL: https://issues.apache.org/jira/browse/SPARK-35238 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.1.1 >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > > As an important cloud provider, Alibaba Cloud presents JindoFS SDK to > maximize the performance for workloads interacting with Alibaba Cloud OSS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35243) Support columnar execution on ANSI interval types
Max Gekk created SPARK-35243: Summary: Support columnar execution on ANSI interval types Key: SPARK-35243 URL: https://issues.apache.org/jira/browse/SPARK-35243 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk See SPARK-30066 as reference implementation for CalendarIntervalType () -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35176) Raise TypeError in inappropriate type case rather than ValueError
[ https://issues.apache.org/jira/browse/SPARK-35176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333246#comment-17333246 ] Apache Spark commented on SPARK-35176: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/32368 > Raise TypeError in inappropriate type case rather than ValueError > -- > > Key: SPARK-35176 > URL: https://issues.apache.org/jira/browse/SPARK-35176 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Minor > > There are many wrong error type usages on ValueError type. > When an operation or function is applied to an object of inappropriate type, > we should use TypeError rather than ValueError. > such as: > [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1137] > [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1228] > > We should do some correction in some right time, note that if we do these > corrections, it will break some catch on original ValueError. > > [1] https://docs.python.org/3/library/exceptions.html#TypeError -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35091) Support ANSI intervals by date_part()
[ https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35091: --- Assignee: Kent Yao > Support ANSI intervals by date_part() > - > > Key: SPARK-35091 > URL: https://issues.apache.org/jira/browse/SPARK-35091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Kent Yao >Priority: Major > > Support year-month and day-time intervals by date_part(). For example: > {code:sql} > > SELECT date_part('days', interval '5 0:0:0' day to second); >5 > > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO > SECOND); >30.001001 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35091) Support ANSI intervals by date_part()
[ https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35091. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32351 [https://github.com/apache/spark/pull/32351] > Support ANSI intervals by date_part() > - > > Key: SPARK-35091 > URL: https://issues.apache.org/jira/browse/SPARK-35091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Kent Yao >Priority: Major > Fix For: 3.2.0 > > > Support year-month and day-time intervals by date_part(). For example: > {code:sql} > > SELECT date_part('days', interval '5 0:0:0' day to second); >5 > > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO > SECOND); >30.001001 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD
[ https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35239: --- Assignee: ulysses you > Coalesce shuffle partition should handle empty input RDD > - > > Key: SPARK-35239 > URL: https://issues.apache.org/jira/browse/SPARK-35239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > > If input RDD partition is empty then the map output statistics will be null. > And if all shuffle stage's input RDD partition is empty, we will skip it and > lose the chance to coalesce partition. > > We can simply create a empty partition for these custom shuffle reader to > reduce the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD
[ https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35239. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32362 [https://github.com/apache/spark/pull/32362] > Coalesce shuffle partition should handle empty input RDD > - > > Key: SPARK-35239 > URL: https://issues.apache.org/jira/browse/SPARK-35239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > If input RDD partition is empty then the map output statistics will be null. > And if all shuffle stage's input RDD partition is empty, we will skip it and > lose the chance to coalesce partition. > > We can simply create a empty partition for these custom shuffle reader to > reduce the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-35020) Group exception messages in catalyst/util
[ https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-35020: --- Comment: was deleted (was: I'm working on.) > Group exception messages in catalyst/util > - > > Key: SPARK-35020 > URL: https://issues.apache.org/jira/browse/SPARK-35020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util' > || Filename || Count || > | ArrayBasedMapBuilder.scala| 3 | > | DateTimeFormatterHelper.scala | 3 | > | DateTimeUtils.scala | 3 | > | IntervalUtils.scala | 2 | > | TimestampFormatter.scala | 1 | > | TypeUtils.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35020) Group exception messages in catalyst/util
[ https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333131#comment-17333131 ] Apache Spark commented on SPARK-35020: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/32367 > Group exception messages in catalyst/util > - > > Key: SPARK-35020 > URL: https://issues.apache.org/jira/browse/SPARK-35020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util' > || Filename || Count || > | ArrayBasedMapBuilder.scala| 3 | > | DateTimeFormatterHelper.scala | 3 | > | DateTimeUtils.scala | 3 | > | IntervalUtils.scala | 2 | > | TimestampFormatter.scala | 1 | > | TypeUtils.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35020) Group exception messages in catalyst/util
[ https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35020: Assignee: Apache Spark > Group exception messages in catalyst/util > - > > Key: SPARK-35020 > URL: https://issues.apache.org/jira/browse/SPARK-35020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util' > || Filename || Count || > | ArrayBasedMapBuilder.scala| 3 | > | DateTimeFormatterHelper.scala | 3 | > | DateTimeUtils.scala | 3 | > | IntervalUtils.scala | 2 | > | TimestampFormatter.scala | 1 | > | TypeUtils.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35020) Group exception messages in catalyst/util
[ https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35020: Assignee: (was: Apache Spark) > Group exception messages in catalyst/util > - > > Key: SPARK-35020 > URL: https://issues.apache.org/jira/browse/SPARK-35020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util' > || Filename || Count || > | ArrayBasedMapBuilder.scala| 3 | > | DateTimeFormatterHelper.scala | 3 | > | DateTimeUtils.scala | 3 | > | IntervalUtils.scala | 2 | > | TimestampFormatter.scala | 1 | > | TypeUtils.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals
[ https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34878: Assignee: (was: Apache Spark) > Test actual size of year-month and day-time intervals > - > > Key: SPARK-34878 > URL: https://issues.apache.org/jira/browse/SPARK-34878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Add tests to > https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals
[ https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34878: Assignee: Apache Spark > Test actual size of year-month and day-time intervals > - > > Key: SPARK-34878 > URL: https://issues.apache.org/jira/browse/SPARK-34878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Add tests to > https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34878) Test actual size of year-month and day-time intervals
[ https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333106#comment-17333106 ] Apache Spark commented on SPARK-34878: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/32366 > Test actual size of year-month and day-time intervals > - > > Key: SPARK-34878 > URL: https://issues.apache.org/jira/browse/SPARK-34878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Add tests to > https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-35060) Group exception messages in sql/types
[ https://issues.apache.org/jira/browse/SPARK-35060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-35060: --- Comment: was deleted (was: I'm working on.) > Group exception messages in sql/types > - > > Key: SPARK-35060 > URL: https://issues.apache.org/jira/browse/SPARK-35060 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Assignee: jiaan.geng >Priority: Major > Fix For: 3.2.0 > > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/types' > || Filename || Count || > | Decimal.scala| 4 | > | Metadata.scala | 4 | > | StructType.scala | 1 | > | numerics.scala | 7 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35020) Group exception messages in catalyst/util
[ https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333101#comment-17333101 ] jiaan.geng commented on SPARK-35020: I'm working on. > Group exception messages in catalyst/util > - > > Key: SPARK-35020 > URL: https://issues.apache.org/jira/browse/SPARK-35020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > > 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util' > || Filename || Count || > | ArrayBasedMapBuilder.scala| 3 | > | DateTimeFormatterHelper.scala | 3 | > | DateTimeUtils.scala | 3 | > | IntervalUtils.scala | 2 | > | TimestampFormatter.scala | 1 | > | TypeUtils.scala | 1 | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform
[ https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333096#comment-17333096 ] Apache Spark commented on SPARK-35228: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32365 > Add expression ToHiveString for keep consistent between hive/spark format in > df.show and transform > -- > > Key: SPARK-35228 > URL: https://issues.apache.org/jira/browse/SPARK-35228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > > According to > [https://github.com/apache/spark/pull/32335#discussion_r620027850] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform
[ https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35228: Assignee: Apache Spark > Add expression ToHiveString for keep consistent between hive/spark format in > df.show and transform > -- > > Key: SPARK-35228 > URL: https://issues.apache.org/jira/browse/SPARK-35228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > According to > [https://github.com/apache/spark/pull/32335#discussion_r620027850] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform
[ https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35228: Assignee: (was: Apache Spark) > Add expression ToHiveString for keep consistent between hive/spark format in > df.show and transform > -- > > Key: SPARK-35228 > URL: https://issues.apache.org/jira/browse/SPARK-35228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > > According to > [https://github.com/apache/spark/pull/32335#discussion_r620027850] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform
[ https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333094#comment-17333094 ] Apache Spark commented on SPARK-35228: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32365 > Add expression ToHiveString for keep consistent between hive/spark format in > df.show and transform > -- > > Key: SPARK-35228 > URL: https://issues.apache.org/jira/browse/SPARK-35228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: angerszhu >Priority: Major > > According to > [https://github.com/apache/spark/pull/32335#discussion_r620027850] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35217) com.google.protobuf.Parser.parseFrom() method Can't use in spark
[ https://issues.apache.org/jira/browse/SPARK-35217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuDaoNan resolved SPARK-35217. --- Resolution: Incomplete Amazon emr自带的protobuf-java-version.jar 版本过低导致 > com.google.protobuf.Parser.parseFrom() method Can't use in spark > -- > > Key: SPARK-35217 > URL: https://issues.apache.org/jira/browse/SPARK-35217 > Project: Spark > Issue Type: Question > Components: Deploy, ML >Affects Versions: 2.4.7 > Environment: *platform:* > Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 > [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1 > (Windows10 do not have this problem.) >Reporter: ShuDaoNan >Priority: Trivial > > *platform:* > Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 > [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1 > (Windows10 do not have this problem.) > *problem:* > Can not load TensorFlow model(Java or scala): > {code:java} > [hadoop@ip-172-17-1-1 ~]$ spark-shell --master local[2] --jars > s3://jars/jasypt-1.9.2.jar,s3://jars/commons-pool2-2.0.jar,s3://jars/tensorflow-core-api-0.3.1.jar,s3://jars/tensorflow-core-api-0.3.1-linux-x86_64-mkl.jar,s3://jars/ndarray-0.3.1.jar,s3://jars/javacpp-1.5.4.jar,s3://jars/tensorflow-core-platform-0.3.1.jar,s3://jars/protobuf-java-3.8.0.jar > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7-amzn-1 > /_/ > > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.{tensorflow => tf} > import org.{tensorflow=>tf} > scala> val bundle = tf.SavedModelBundle.load("/home/hadoop/xDeepFM","serve") > 2021-04-23 07:32:56.223881: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading > SavedModel from: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.266424: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta > graph with tags { serve } > 2021-04-23 07:32:56.266468: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading > SavedModel debug info (if present) from: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.346757: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring > SavedModel bundle. > 2021-04-23 07:32:56.873838: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running > initialization op on SavedModel bundle at path: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.928656: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel > load for tags { serve }; Status: success: OK. Took 704788 microseconds. > java.lang.NoSuchMethodError: > com.google.protobuf.Parser.parseFrom(Ljava/nio/ByteBuffer;)Ljava/lang/Object; > at > org.tensorflow.proto.framework.MetaGraphDef.parseFrom(MetaGraphDef.java:3067) > at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:422) > at org.tensorflow.SavedModelBundle.access$000(SavedModelBundle.java:59) > at org.tensorflow.SavedModelBundle$Loader.load(SavedModelBundle.java:68) > at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:242) > ... 49 elided > scala> > {code} > *But I can load the model in a single 'Testtf.java' like that:* > {code:java} > [hadoop@ip-172-17-1-1 ~]$ vi Testtf.java > import org.tensorflow.SavedModelBundle; > public class Testtf { > public static void main(String[] args) { > System.out.println("test load..."); > SavedModelBundle bundle = > org.tensorflow.SavedModelBundle.load("/home/hadoop/xDeepFM","serve"); > System.out.println("loaded bundle..."); > System.out.println(bundle); > } > } > [hadoop@ip-172-17-1-1 ~]$ javac -cp > javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar > Testtf.java > [hadoop@ip-172-17-1-1 ~]$ java -cp > javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar:. > Testtf > test load... > 2021-04-25 02:33:47.247120: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading > SavedModel from: /home/hadoop/xDeepFM > 2021-04-25
[jira] [Updated] (SPARK-35217) com.google.protobuf.Parser.parseFrom() method Can't use in spark
[ https://issues.apache.org/jira/browse/SPARK-35217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ShuDaoNan updated SPARK-35217: -- Priority: Trivial (was: Major) > com.google.protobuf.Parser.parseFrom() method Can't use in spark > -- > > Key: SPARK-35217 > URL: https://issues.apache.org/jira/browse/SPARK-35217 > Project: Spark > Issue Type: Question > Components: Deploy, ML >Affects Versions: 2.4.7 > Environment: *platform:* > Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 > [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1 > (Windows10 do not have this problem.) >Reporter: ShuDaoNan >Priority: Trivial > > *platform:* > Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 > [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 > x86_64 x86_64 x86_64 GNU/Linux > Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1 > (Windows10 do not have this problem.) > *problem:* > Can not load TensorFlow model(Java or scala): > {code:java} > [hadoop@ip-172-17-1-1 ~]$ spark-shell --master local[2] --jars > s3://jars/jasypt-1.9.2.jar,s3://jars/commons-pool2-2.0.jar,s3://jars/tensorflow-core-api-0.3.1.jar,s3://jars/tensorflow-core-api-0.3.1-linux-x86_64-mkl.jar,s3://jars/ndarray-0.3.1.jar,s3://jars/javacpp-1.5.4.jar,s3://jars/tensorflow-core-platform-0.3.1.jar,s3://jars/protobuf-java-3.8.0.jar > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7-amzn-1 > /_/ > > Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.{tensorflow => tf} > import org.{tensorflow=>tf} > scala> val bundle = tf.SavedModelBundle.load("/home/hadoop/xDeepFM","serve") > 2021-04-23 07:32:56.223881: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading > SavedModel from: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.266424: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta > graph with tags { serve } > 2021-04-23 07:32:56.266468: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading > SavedModel debug info (if present) from: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.346757: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring > SavedModel bundle. > 2021-04-23 07:32:56.873838: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running > initialization op on SavedModel bundle at path: /home/hadoop/xDeepFM > 2021-04-23 07:32:56.928656: I > external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel > load for tags { serve }; Status: success: OK. Took 704788 microseconds. > java.lang.NoSuchMethodError: > com.google.protobuf.Parser.parseFrom(Ljava/nio/ByteBuffer;)Ljava/lang/Object; > at > org.tensorflow.proto.framework.MetaGraphDef.parseFrom(MetaGraphDef.java:3067) > at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:422) > at org.tensorflow.SavedModelBundle.access$000(SavedModelBundle.java:59) > at org.tensorflow.SavedModelBundle$Loader.load(SavedModelBundle.java:68) > at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:242) > ... 49 elided > scala> > {code} > *But I can load the model in a single 'Testtf.java' like that:* > {code:java} > [hadoop@ip-172-17-1-1 ~]$ vi Testtf.java > import org.tensorflow.SavedModelBundle; > public class Testtf { > public static void main(String[] args) { > System.out.println("test load..."); > SavedModelBundle bundle = > org.tensorflow.SavedModelBundle.load("/home/hadoop/xDeepFM","serve"); > System.out.println("loaded bundle..."); > System.out.println(bundle); > } > } > [hadoop@ip-172-17-1-1 ~]$ javac -cp > javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar > Testtf.java > [hadoop@ip-172-17-1-1 ~]$ java -cp > javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar:. > Testtf > test load... > 2021-04-25 02:33:47.247120: I > external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading > SavedModel from: /home/hadoop/xDeepFM > 2021-04-25 02:33:47.275349: I >
[jira] [Assigned] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35242: Assignee: (was: Apache Spark) > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333055#comment-17333055 ] Apache Spark commented on SPARK-35242: -- User 'hddong' has created a pull request for this issue: https://github.com/apache/spark/pull/32364 > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35242: Assignee: Apache Spark > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Assignee: Apache Spark >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35242) Support change catalog default database for spark
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hong dongdong updated SPARK-35242: -- Summary: Support change catalog default database for spark (was: Support change default database for spark sql) > Support change catalog default database for spark > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35242) Support change default database for spark sql
[ https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hong dongdong updated SPARK-35242: -- Description: Spark catalog default database can only be 'default'. When we can not access 'default', we will get Exception 'Permission denied:'. We should support change default datbase for catalog like 'jdbc/thrift' does. > Support change default database for spark sql > - > > Key: SPARK-35242 > URL: https://issues.apache.org/jira/browse/SPARK-35242 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: hong dongdong >Priority: Major > > Spark catalog default database can only be 'default'. When we can not access > 'default', we will get Exception 'Permission denied:'. We should support > change default datbase for catalog like 'jdbc/thrift' does. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333038#comment-17333038 ] Yikun Jiang edited comment on SPARK-34979 at 4/27/21, 8:34 AM: --- Note that arrow 4.0.0 has been published, and pyarrow has been published today, [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0] [2] [https://pypi.org/project/pyarrow/4.0.0/] was (Author: yikunkero): Note that arrow 4.0.0 has been published, and pyarrow has been published today, [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.[2]0|https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0] [2] [https://pypi.org/project/pyarrow/4.0.0/] > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333039#comment-17333039 ] Apache Spark commented on SPARK-34979: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/32363 > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333038#comment-17333038 ] Yikun Jiang commented on SPARK-34979: - Note that arrow 4.0.0 has been published, and pyarrow has been published today, [1] [https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.[2]0|https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0] [2] [https://pypi.org/project/pyarrow/4.0.0/] > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34979: Assignee: Apache Spark > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64
[ https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34979: Assignee: (was: Apache Spark) > Failed to install pyspark[sql] (due to pyarrow error) on aarch64 > > > Key: SPARK-34979 > URL: https://issues.apache.org/jira/browse/SPARK-34979 > Project: Spark > Issue Type: Task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Yikun Jiang >Priority: Major > > ~^$ pip install pyspark[sql]^~ > ~^Collecting pyarrow>=1.0.0^~ > ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~ > ~^Installing build dependencies ... done^~ > ~^Getting requirements to build wheel ... done^~ > ~^Preparing wheel metadata ... done^~ > ~^// ... ...^~ > ~^Building wheels for collected packages: pyarrow^~ > ~^Building wheel for pyarrow (PEP 517) ... error^~ > ~^ERROR: Command errored out with exit status 1:^~ > ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel > /tmp/tmpq0n5juib^~ > ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~ > ~^Complete output (183 lines):^~ > ~^– Running cmake for pyarrow^~ > ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 > -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off > -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off > -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off > -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off > -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off > -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off > -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release > /tmp/pip-install-sh0myu71/pyarrow^~ > ~^error: command 'cmake' failed with exit status 1^~ > ~^^~ > ~^ERROR: Failed building wheel for pyarrow^~ > ~^Failed to build pyarrow^~ > ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be > installed directly^~ > > The pip installation would be failed, due to the dependency pyarrow install > failed. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35242) Support change default database for spark sql
hong dongdong created SPARK-35242: - Summary: Support change default database for spark sql Key: SPARK-35242 URL: https://issues.apache.org/jira/browse/SPARK-35242 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1 Reporter: hong dongdong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD
[ https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35239: Assignee: (was: Apache Spark) > Coalesce shuffle partition should handle empty input RDD > - > > Key: SPARK-35239 > URL: https://issues.apache.org/jira/browse/SPARK-35239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > If input RDD partition is empty then the map output statistics will be null. > And if all shuffle stage's input RDD partition is empty, we will skip it and > lose the chance to coalesce partition. > > We can simply create a empty partition for these custom shuffle reader to > reduce the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD
[ https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35239: Assignee: Apache Spark > Coalesce shuffle partition should handle empty input RDD > - > > Key: SPARK-35239 > URL: https://issues.apache.org/jira/browse/SPARK-35239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > If input RDD partition is empty then the map output statistics will be null. > And if all shuffle stage's input RDD partition is empty, we will skip it and > lose the chance to coalesce partition. > > We can simply create a empty partition for these custom shuffle reader to > reduce the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD
[ https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333016#comment-17333016 ] Apache Spark commented on SPARK-35239: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32362 > Coalesce shuffle partition should handle empty input RDD > - > > Key: SPARK-35239 > URL: https://issues.apache.org/jira/browse/SPARK-35239 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > If input RDD partition is empty then the map output statistics will be null. > And if all shuffle stage's input RDD partition is empty, we will skip it and > lose the chance to coalesce partition. > > We can simply create a empty partition for these custom shuffle reader to > reduce the partition number. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35241) Investigate to prefer vectorized hash map in hash aggregate selectively
Cheng Su created SPARK-35241: Summary: Investigate to prefer vectorized hash map in hash aggregate selectively Key: SPARK-35241 URL: https://issues.apache.org/jira/browse/SPARK-35241 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su In hash aggregate, we always use row-based hash map as first level hash map in production, and use vectorized hash map in testing / benchmarking. However we do find in micro-benchmark that vectorized hash map is better than row-based hash map e.g. with single key - [https://github.com/apache/spark/pull/32357#discussion_r620914345] . So we should re-evaluate the decision to always use row-based hash map or not. And maybe come up with a more adaptive decision policy to choose which map to use depending on keys / values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333000#comment-17333000 ] Anthony edited comment on SPARK-18105 at 4/27/21, 7:26 AM: --- We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible but often seen {code:java} org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.(UnsafeRowSerializer.scala:120) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.asKeyValueIterator(UnsafeRowSerializer.scala:110) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$2(BlockStoreShuffleReader.scala:92) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.sort_addToSorter_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:860) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:730) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:254) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:250) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:819) ... 35 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 10122 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:245) ... 37 more {code} was (Author: jangryeo): We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible but often seen org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333000#comment-17333000 ] Anthony commented on SPARK-18105: - We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible but often seen org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.(UnsafeRowSerializer.scala:120) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.asKeyValueIterator(UnsafeRowSerializer.scala:110) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$2(BlockStoreShuffleReader.scala:92) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.sort_addToSorter_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:860) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:730) at org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:254) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Stream is corrupted at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:250) at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:819) ... 35 more Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 10122 of input buffer at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39) at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:245) ... 37 more > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Priority: Major > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at >
[jira] [Commented] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation
[ https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332996#comment-17332996 ] Apache Spark commented on SPARK-35240: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32361 > Use CheckpointFileManager for checkpoint manipulation > - > > Key: SPARK-35240 > URL: https://issues.apache.org/jira/browse/SPARK-35240 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > `CheckpointFileManager` is designed to handle checkpoint file manipulation. > However, there are a few places exposing FileSystem from checkpoint > files/paths. We should use `CheckpointFileManager` to manipulate checkpoint > files. For example, we may want to have one storage system for checkpoint > file. If all checkpoint file manipulation is performed through > `CheckpointFileManager`, we can only implement `CheckpointFileManager` for > the storage system, and don't need to implement FileSystem API for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation
[ https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35240: Assignee: Apache Spark (was: L. C. Hsieh) > Use CheckpointFileManager for checkpoint manipulation > - > > Key: SPARK-35240 > URL: https://issues.apache.org/jira/browse/SPARK-35240 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > `CheckpointFileManager` is designed to handle checkpoint file manipulation. > However, there are a few places exposing FileSystem from checkpoint > files/paths. We should use `CheckpointFileManager` to manipulate checkpoint > files. For example, we may want to have one storage system for checkpoint > file. If all checkpoint file manipulation is performed through > `CheckpointFileManager`, we can only implement `CheckpointFileManager` for > the storage system, and don't need to implement FileSystem API for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation
[ https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35240: Assignee: L. C. Hsieh (was: Apache Spark) > Use CheckpointFileManager for checkpoint manipulation > - > > Key: SPARK-35240 > URL: https://issues.apache.org/jira/browse/SPARK-35240 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > `CheckpointFileManager` is designed to handle checkpoint file manipulation. > However, there are a few places exposing FileSystem from checkpoint > files/paths. We should use `CheckpointFileManager` to manipulate checkpoint > files. For example, we may want to have one storage system for checkpoint > file. If all checkpoint file manipulation is performed through > `CheckpointFileManager`, we can only implement `CheckpointFileManager` for > the storage system, and don't need to implement FileSystem API for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation
[ https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332995#comment-17332995 ] Apache Spark commented on SPARK-35240: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32361 > Use CheckpointFileManager for checkpoint manipulation > - > > Key: SPARK-35240 > URL: https://issues.apache.org/jira/browse/SPARK-35240 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > `CheckpointFileManager` is designed to handle checkpoint file manipulation. > However, there are a few places exposing FileSystem from checkpoint > files/paths. We should use `CheckpointFileManager` to manipulate checkpoint > files. For example, we may want to have one storage system for checkpoint > file. If all checkpoint file manipulation is performed through > `CheckpointFileManager`, we can only implement `CheckpointFileManager` for > the storage system, and don't need to implement FileSystem API for it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org