[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:scala}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = 
enableVectorized

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter pass 
to the MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

The codes as follows:

{code:scala}
// RunTask.scala
override def runTask(context: TaskContext): U = {
// Deserialize the RDD and the func using the broadcast variables.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTimeNs = System.nanoTime()
val deserializeStartCpuTime = if 
(threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
   //  the rdd 
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => 
U)](
  ByteBuffer.wrap(taskBinary.value), 
Thread.currentThread.getContextClassLoader)
_executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
_executorDeserializeCpuTime = if 
(threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L

func(context, rdd.iterator(partition, context))
  }
{code}


  was:
The codes of "MyPartitionReaderFactory" :
{code:scala}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Affects Version/s: 3.1.1

> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: lynn
>Priority: Major
> Attachments: spark-sqlconf-isnull.png
>
>
> The codes of "MyPartitionReaderFactory" :
> {code:scala}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import 
> com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
>  MY_VECTORIZED_READER_ENABLED}
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> import org.apache.spark.sql.internal.SQLConf.buildConf
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
>   val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = 
> enableVectorized
> }
> object MyPartitionReaderFactory {
>   val MY_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.my.enableVectorizedReader")
>   .doc("Enables vectorized my source scan.")
>   .version("1.0.0")
>   .booleanConf
>   .createWithDefault(false)
>   val MY_VECTORIZED_READER_BATCH_SIZE =
> buildConf("spark.sql.my.columnarReaderBatchSize")
>   .doc("The number of rows to include in a my source vectorized reader 
> batch. The number should " +
> "be carefully chosen to minimize overhead and avoid OOMs in reading 
> data.")
>   .version("1.0.0")
>   .intConf
>   .createWithDefault(4096)
> }
> {code}
> The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter 
> pass to the MyPartitionReaderFactory  is not null.
> But when the executor deserialize the RDD, the sqlConf parameter is null.
> The codes as follows:
> {code:scala}
> // RunTask.scala
> override def runTask(context: TaskContext): U = {
> // Deserialize the RDD and the func using the broadcast variables.
> val threadMXBean = ManagementFactory.getThreadMXBean
> val deserializeStartTimeNs = System.nanoTime()
> val deserializeStartCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime
> } else 0L
> val ser = SparkEnv.get.closureSerializer.newInstance()
> val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => 
> U)](
>   ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
> _executorDeserializeCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
> } else 0L
> func(context, rdd.iterator(partition, context))
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Attachment: (was: znbase-sqlconf-isnull.png)

> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
> Attachments: spark-sqlconf-isnull.png
>
>
> The codes of "MyPartitionReaderFactory" :
> {code:scala}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import 
> com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
>  MY_VECTORIZED_READER_ENABLED}
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> import org.apache.spark.sql.internal.SQLConf.buildConf
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
>   val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = 
> enableVectorized
> }
> object MyPartitionReaderFactory {
>   val MY_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.my.enableVectorizedReader")
>   .doc("Enables vectorized my source scan.")
>   .version("1.0.0")
>   .booleanConf
>   .createWithDefault(false)
>   val MY_VECTORIZED_READER_BATCH_SIZE =
> buildConf("spark.sql.my.columnarReaderBatchSize")
>   .doc("The number of rows to include in a my source vectorized reader 
> batch. The number should " +
> "be carefully chosen to minimize overhead and avoid OOMs in reading 
> data.")
>   .version("1.0.0")
>   .intConf
>   .createWithDefault(4096)
> }
> {code}
> The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter 
> pass to the MyPartitionReaderFactory  is not null.
> But when the executor deserialize the RDD, the sqlConf parameter is null.
> The codes as follows:
> {code:scala}
> // RunTask.scala
> override def runTask(context: TaskContext): U = {
> // Deserialize the RDD and the func using the broadcast variables.
> val threadMXBean = ManagementFactory.getThreadMXBean
> val deserializeStartTimeNs = System.nanoTime()
> val deserializeStartCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime
> } else 0L
> val ser = SparkEnv.get.closureSerializer.newInstance()
> val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => 
> U)](
>   ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> _executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
> _executorDeserializeCpuTime = if 
> (threadMXBean.isCurrentThreadCpuTimeSupported) {
>   threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
> } else 0L
> func(context, rdd.iterator(partition, context))
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:scala}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = 
enableVectorized

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance(DataSourceRDD), the sqlConf parameter pass 
to the MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

The codes as follows:

{code:scala}
// RunTask.scala
override def runTask(context: TaskContext): U = {
// Deserialize the RDD and the func using the broadcast variables.
val threadMXBean = ManagementFactory.getThreadMXBean
val deserializeStartTimeNs = System.nanoTime()
val deserializeStartCpuTime = if 
(threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime
} else 0L
val ser = SparkEnv.get.closureSerializer.newInstance()
val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => 
U)](
  ByteBuffer.wrap(taskBinary.value), 
Thread.currentThread.getContextClassLoader)
_executorDeserializeTimeNs = System.nanoTime() - deserializeStartTimeNs
_executorDeserializeCpuTime = if 
(threadMXBean.isCurrentThreadCpuTimeSupported) {
  threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
} else 0L

func(context, rdd.iterator(partition, context))
  }
{code}


  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Attachment: spark-sqlconf-isnull.png

> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
> Attachments: spark-sqlconf-isnull.png, znbase-sqlconf-isnull.png
>
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import 
> com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
>  MY_VECTORIZED_READER_ENABLED}
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> import org.apache.spark.sql.internal.SQLConf.buildConf
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
>   val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = 
> enableVectorized
> }
> object MyPartitionReaderFactory {
>   val MY_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.my.enableVectorizedReader")
>   .doc("Enables vectorized my source scan.")
>   .version("1.0.0")
>   .booleanConf
>   .createWithDefault(false)
>   val MY_VECTORIZED_READER_BATCH_SIZE =
> buildConf("spark.sql.my.columnarReaderBatchSize")
>   .doc("The number of rows to include in a my source vectorized reader 
> batch. The number should " +
> "be carefully chosen to minimize overhead and avoid OOMs in reading 
> data.")
>   .version("1.0.0")
>   .intConf
>   .createWithDefault(4096)
> }
> {code}
> The driver construct a RDD instance, the sqlConf parameter pass to the 
> MyPartitionReaderFactory  is not null.
> But when the executor deserialize the RDD, the sqlConf parameter is null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Attachment: znbase-sqlconf-isnull.png

> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
> Attachments: spark-sqlconf-isnull.png, znbase-sqlconf-isnull.png
>
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import 
> com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
>  MY_VECTORIZED_READER_ENABLED}
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> import org.apache.spark.sql.internal.SQLConf.buildConf
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
>   val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = 
> enableVectorized
> }
> object MyPartitionReaderFactory {
>   val MY_VECTORIZED_READER_ENABLED =
> buildConf("spark.sql.my.enableVectorizedReader")
>   .doc("Enables vectorized my source scan.")
>   .version("1.0.0")
>   .booleanConf
>   .createWithDefault(false)
>   val MY_VECTORIZED_READER_BATCH_SIZE =
> buildConf("spark.sql.my.columnarReaderBatchSize")
>   .doc("The number of rows to include in a my source vectorized reader 
> batch. The number should " +
> "be carefully chosen to minimize overhead and avoid OOMs in reading 
> data.")
>   .version("1.0.0")
>   .intConf
>   .createWithDefault(4096)
> }
> {code}
> The driver construct a RDD instance, the sqlConf parameter pass to the 
> MyPartitionReaderFactory  is not null.
> But when the executor deserialize the RDD, the sqlConf parameter is null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34878) Test actual size of year-month and day-time intervals

2021-04-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-34878.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32366
[https://github.com/apache/spark/pull/32366]

> Test actual size of year-month and day-time intervals
> -
>
> Key: SPARK-34878
> URL: https://issues.apache.org/jira/browse/SPARK-34878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: PengLei
>Priority: Major
> Fix For: 3.2.0
>
>
> Add tests to 
> https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals

2021-04-27 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-34878:


Assignee: PengLei

> Test actual size of year-month and day-time intervals
> -
>
> Key: SPARK-34878
> URL: https://issues.apache.org/jira/browse/SPARK-34878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: PengLei
>Priority: Major
>
> Add tests to 
> https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2021-04-27 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334443#comment-17334443
 ] 

Manu Zhang commented on SPARK-32083:


This issue has been fixed in https://issues.apache.org/jira/browse/SPARK-35239

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = 
enableVectorized

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import 
com.lynn.spark.sql.v2.MyPartitionReaderFactory.{MY_VECTORIZED_READER_BATCH_SIZE,
 MY_VECTORIZED_READER_ENABLED}
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch
import org.apache.spark.sql.internal.SQLConf.buildConf

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  val enableVectorized = sqlConf.getConf(MY_VECTORIZED_READER_ENABLED, false)
  val batchSize = sqlConf.getConf(MY_VECTORIZED_READER_BATCH_SIZE, 4096)

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}

object MyPartitionReaderFactory {

  val MY_VECTORIZED_READER_ENABLED =
buildConf("spark.sql.my.enableVectorizedReader")
  .doc("Enables vectorized my source scan.")
  .version("1.0.0")
  .booleanConf
  .createWithDefault(false)

  val MY_VECTORIZED_READER_BATCH_SIZE =
buildConf("spark.sql.my.columnarReaderBatchSize")
  .doc("The number of rows to include in a my source vectorized reader 
batch. The number should " +
"be carefully chosen to minimize overhead and avoid OOMs in reading 
data.")
  .version("1.0.0")
  .intConf
  .createWithDefault(4096)
}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  //TODO: delete this line
  println(sqlConf.getAllConfs)
  
  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   val 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  //TODO: delete this line
  println(sqlConf.getAllConfs)
  
  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   //TODO: delete this line
>   println(sqlConf.getAllConfs)
>   
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}
> The driver construct a 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

   MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
>MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}
> The driver construct a RDD instance, the sqlConf parameter pass to the 
> 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" :
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

The driver construct a RDD instance, the sqlConf parameter pass to the 
MyPartitionReaderFactory  is not null.
But when the executor deserialize the RDD, the sqlConf parameter is null.

  was:
The codes of "MyPartitionReaderFactory" of 
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of "MyPartitionReaderFactory" :
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
> //TODO: delete this line
> println(sqlConf.getAllConfs)
> MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}
> The driver construct a RDD instance, the sqlConf parameter pass to the 
> MyPartitionReaderFactory  is not null.
> But when the executor deserialize the 

[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of "MyPartitionReaderFactory" of 
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

  was:
The codes of 
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of "MyPartitionReaderFactory" of 
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
> //TODO: delete this line
> println(sqlConf.getAllConfs)
> MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Summary: PartitionReaderFactory's Implemention Class of DataSourceV2 
sqlConf parameter is null  (was: PartitionReaderFactory's Implemention Class of 
DataSourceV2 sqlConfi parameter is null)

> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConf parameter 
> is null
> -
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of 
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
> //TODO: delete this line
> println(sqlConf.getAllConfs)
> MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null

2021-04-27 Thread lynn (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lynn updated SPARK-35252:
-
Description: 
The codes of 
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}

  was:
{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}


> PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi 
> parameter is null
> --
>
> Key: SPARK-35252
> URL: https://issues.apache.org/jira/browse/SPARK-35252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: lynn
>Priority: Major
>
> The codes of 
> {code:java}
> // Implemention Class
> package com.lynn.spark.sql.v2
> import org.apache.spark.internal.Logging
> import org.apache.spark.sql.catalyst.InternalRow
> import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
> PartitionReaderFactory}
> import org.apache.spark.sql.internal.SQLConf
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.vectorized.ColumnarBatch
> case class MyPartitionReaderFactory(sqlConf: SQLConf,
> dataSchema: StructType,
> readSchema: StructType)
>   extends PartitionReaderFactory with Logging {
>   def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")
>   override def createReader(partition: InputPartition): 
> PartitionReader[InternalRow] = {
> MyRowReader(batchSize, dataSchema, readSchema)
>   }
>   override def createColumnarReader(partition: InputPartition): 
> PartitionReader[ColumnarBatch] = {
> if(!supportColumnarReads(partition))
>   throw new UnsupportedOperationException("Cannot create columnar 
> reader.")
> //TODO: delete this line
> println(sqlConf.getAllConfs)
> MyColumnReader(batchSize, dataSchema, readSchema)
>   }
>   override def supportColumnarReads(partition: InputPartition) = true
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35252) PartitionReaderFactory's Implemention Class of DataSourceV2 sqlConfi parameter is null

2021-04-27 Thread lynn (Jira)
lynn created SPARK-35252:


 Summary: PartitionReaderFactory's Implemention Class of 
DataSourceV2 sqlConfi parameter is null
 Key: SPARK-35252
 URL: https://issues.apache.org/jira/browse/SPARK-35252
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
Reporter: lynn


{code:java}
// Implemention Class
package com.lynn.spark.sql.v2

import org.apache.spark.internal.Logging
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.connector.read.{InputPartition, PartitionReader, 
PartitionReaderFactory}
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.vectorized.ColumnarBatch

case class MyPartitionReaderFactory(sqlConf: SQLConf,
dataSchema: StructType,
readSchema: StructType)
  extends PartitionReaderFactory with Logging {

  def batchSize = sqlConf.getConf("spark.sql.my.vectorized.batch.size")

  override def createReader(partition: InputPartition): 
PartitionReader[InternalRow] = {
MyRowReader(batchSize, dataSchema, readSchema)
  }

  override def createColumnarReader(partition: InputPartition): 
PartitionReader[ColumnarBatch] = {
if(!supportColumnarReads(partition))
  throw new UnsupportedOperationException("Cannot create columnar reader.")

//TODO: delete this line
println(sqlConf.getAllConfs)

MyColumnReader(batchSize, dataSchema, readSchema)

  }

  override def supportColumnarReads(partition: InputPartition) = true

}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35251) Improve LiveEntityHelpers.newAccumulatorInfos performace

2021-04-27 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-35251:
---

 Summary: Improve LiveEntityHelpers.newAccumulatorInfos performace
 Key: SPARK-35251
 URL: https://issues.apache.org/jira/browse/SPARK-35251
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Yuming Wang


[It|https://github.com/apache/spark/blob/3854ad87c78f2a331f9c9c1a34f9ec281900f8fe/core/src/main/scala/org/apache/spark/status/LiveEntity.scala#L646-L661]
 will impact performance if  there are lot of {{AccumulableInfo}} instances.

{noformat}
 num #instances #bytes  class name
--
   1: 33189785050708371592  [C
   2:79535824591381896  [J
   3: 33083623410586759488  java.lang.String
   4: 139726297 8942483008  
org.apache.spark.sql.execution.metric.SQLMetric
   5: 249040156 5976963744  scala.Some
   6: 146278719 5851148760  org.apache.spark.util.AccumulatorMetadata
   7:  38448113 5536528272  java.net.URI
   8:  37540162 360382  org.apache.hadoop.fs.FileStatus
   9:  69724130 3346758240  java.util.Hashtable$Entry
  10:  61521559 2953034832  java.util.concurrent.ConcurrentHashMap$Node
  11:  50421974 2823630544  scala.collection.mutable.LinkedEntry
  12:  43349222 2774350208  org.apache.spark.scheduler.AccumulableInfo
{noformat}


{noformat}
--- 15430388364 ns (2.03%), 1543 samples
  [ 0] scala.collection.TraversableLike.noneIn$1
  [ 1] scala.collection.TraversableLike.filterImpl
  [ 2] scala.collection.TraversableLike.filterImpl$
  [ 3] scala.collection.AbstractTraversable.filterImpl
  [ 4] scala.collection.TraversableLike.filter
  [ 5] scala.collection.TraversableLike.filter$
  [ 6] scala.collection.AbstractTraversable.filter
  [ 7] org.apache.spark.status.LiveEntityHelpers$.newAccumulatorInfos
  [ 8] org.apache.spark.status.LiveTask.doUpdate
  [ 9] org.apache.spark.status.LiveEntity.write
{noformat}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35250) SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented

2021-04-27 Thread Tim McNamara (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim McNamara updated SPARK-35250:
-
Summary: SQL DataFrameReader unescapedQuoteHandling parameter is 
misdocumented  (was: SQL DataFrameReader mode is misdocumented)

> SQL DataFrameReader unescapedQuoteHandling parameter is misdocumented
> -
>
> Key: SPARK-35250
> URL: https://issues.apache.org/jira/browse/SPARK-35250
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, Documentation
>Affects Versions: 3.1.2
> Environment:  
>Reporter: Tim McNamara
>Priority: Major
>  Labels: GoodForNewContributors, easy-fix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
> documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
> overwritten an intended option, e.g. 
> [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]
> To view instances where this error occurs, this is a useful query 
> [https://github.com/apache/spark/search?q=STOP_AT_DELIMITER]
> It appears that this bug was introduced here: 
> [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35248:
-
Target Version/s:   (was: 3.1.2)

> Spark shall load system class first in IsolatedClientLoader
> ---
>
> Key: SPARK-35248
> URL: https://issues.apache.org/jira/browse/SPARK-35248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Daniel Dai
>Priority: Major
>
> This happens when Spark try to load HMS client jars using 
> IsolatedClientLoader, in particular, when 
> "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a 
> conflicting thrift lib in user jar, which results in exception:
> {code:java}
>  21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.sql.AnalysisException: 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> 

[jira] [Resolved] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker

2021-04-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35246.
--
  Assignee: Jose Torres
Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/32371]

> Streaming-batch intersects are incorrectly allowed through 
> UnsupportedOperationsChecker
> ---
>
> Key: SPARK-35246
> URL: https://issues.apache.org/jira/browse/SPARK-35246
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.2.0
>
>
> The UnsupportedOperationChecker currently rejects streaming intersects only 
> if both sides are streaming, but they don't work if even one side is 
> streaming. The following simple test, for example, fails with a cryptic 
> None.get error because the state store can't plan itself properly.
> {code:java}
>   test("intersect") {
>     val input = MemoryStream[Long]
>     val df = input.toDS().intersect(spark.range(10).as[Long])
>     testStream(df) (
>       AddData(input, 1L),
>       CheckAnswer(1)
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35244.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

[https://github.com/apache/spark/pull/32370]

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35250) SQL DataFrameReader mode is misdocumented

2021-04-27 Thread Tim McNamara (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim McNamara updated SPARK-35250:
-
Description: 
The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
overwritten an intended option, e.g. 
[https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]

To view instances where this error occurs, this is a useful query 
https://github.com/apache/spark/search?q=STOP_AT_DELIMITER

 

It appears that this bug was introduced here: 
[https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]

  was:
The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
overwritten an intended option.

e.g.
 * 
[https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]

 

It appears that this bug was introduced here: 
[https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]

Environment:    (was: 
 )

> SQL DataFrameReader mode is misdocumented
> -
>
> Key: SPARK-35250
> URL: https://issues.apache.org/jira/browse/SPARK-35250
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, Documentation
>Affects Versions: 3.1.2
> Environment:  
>Reporter: Tim McNamara
>Priority: Major
>  Labels: GoodForNewContributors, easy-fix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
> documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
> overwritten an intended option, e.g. 
> [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]
> To view instances where this error occurs, this is a useful query 
> https://github.com/apache/spark/search?q=STOP_AT_DELIMITER
>  
> It appears that this bug was introduced here: 
> [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35250) SQL DataFrameReader mode is misdocumented

2021-04-27 Thread Tim McNamara (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim McNamara updated SPARK-35250:
-
Description: 
The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
overwritten an intended option, e.g. 
[https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]

To view instances where this error occurs, this is a useful query 
[https://github.com/apache/spark/search?q=STOP_AT_DELIMITER]

It appears that this bug was introduced here: 
[https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]

  was:
The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
overwritten an intended option, e.g. 
[https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]

To view instances where this error occurs, this is a useful query 
https://github.com/apache/spark/search?q=STOP_AT_DELIMITER

 

It appears that this bug was introduced here: 
[https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]


> SQL DataFrameReader mode is misdocumented
> -
>
> Key: SPARK-35250
> URL: https://issues.apache.org/jira/browse/SPARK-35250
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, Documentation
>Affects Versions: 3.1.2
> Environment:  
>Reporter: Tim McNamara
>Priority: Major
>  Labels: GoodForNewContributors, easy-fix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
> documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
> overwritten an intended option, e.g. 
> [https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]
> To view instances where this error occurs, this is a useful query 
> [https://github.com/apache/spark/search?q=STOP_AT_DELIMITER]
> It appears that this bug was introduced here: 
> [https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35250) SQL DataFrameReader mode is misdocumented

2021-04-27 Thread Tim McNamara (Jira)
Tim McNamara created SPARK-35250:


 Summary: SQL DataFrameReader mode is misdocumented
 Key: SPARK-35250
 URL: https://issues.apache.org/jira/browse/SPARK-35250
 Project: Spark
  Issue Type: Documentation
  Components: docs, Documentation
Affects Versions: 3.1.2
 Environment: 
 
Reporter: Tim McNamara


The unescapedQuoteHandling parameter of DataFrameReader isn't correctly 
documented. STOP_AT_DELIMITER appears twice, and it looks like that's 
overwritten an intended option.

e.g.
 * 
[https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L744-L749]

 

It appears that this bug was introduced here: 
[https://github.com/apache/spark/pull/30518|https://github.com/apache/spark/pull/30518,]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35236) Support archive files as resources for CREATE FUNCTION USING syntax

2021-04-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35236.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/32359]

> Support archive files as resources for CREATE FUNCTION USING syntax
> ---
>
> Key: SPARK-35236
> URL: https://issues.apache.org/jira/browse/SPARK-35236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> CREATE FUNCTION USING syntax doesn't support archives as resources.
> The reason is archives were not supported in Spark SQL.
> Now Spark SQL supports archives so I think we can support them for the syntax.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35243) Support columnar execution on ANSI interval types

2021-04-27 Thread PengLei (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334393#comment-17334393
 ] 

PengLei commented on SPARK-35243:
-

Working on this

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35249) to_timestamp can't parse 6 digit microsecond SSSSSS

2021-04-27 Thread t oo (Jira)
t oo created SPARK-35249:


 Summary: to_timestamp can't parse 6 digit microsecond SS
 Key: SPARK-35249
 URL: https://issues.apache.org/jira/browse/SPARK-35249
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.6
Reporter: t oo


spark-sql> select x, to_timestamp(x,"MMM dd  hh:mm:ss.SS") from (select 
'Apr 13 2021 12:00:00.001000AM' x);
Apr 13 2021 12:00:00.001000AM NULL

 

Why doesn't the to_timestamp work?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34979.
--
Fix Version/s: 3.2.0
 Assignee: Yikun Jiang
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/32363]

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Trivial
> Fix For: 3.2.0
>
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333555#comment-17333555
 ] 

Apache Spark commented on SPARK-35248:
--

User 'daijyc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32373

> Spark shall load system class first in IsolatedClientLoader
> ---
>
> Key: SPARK-35248
> URL: https://issues.apache.org/jira/browse/SPARK-35248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Daniel Dai
>Priority: Major
>
> This happens when Spark try to load HMS client jars using 
> IsolatedClientLoader, in particular, when 
> "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a 
> conflicting thrift lib in user jar, which results in exception:
> {code:java}
>  21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.sql.AnalysisException: 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> 

[jira] [Commented] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333554#comment-17333554
 ] 

Apache Spark commented on SPARK-35248:
--

User 'daijyc' has created a pull request for this issue:
https://github.com/apache/spark/pull/32373

> Spark shall load system class first in IsolatedClientLoader
> ---
>
> Key: SPARK-35248
> URL: https://issues.apache.org/jira/browse/SPARK-35248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Daniel Dai
>Priority: Major
>
> This happens when Spark try to load HMS client jars using 
> IsolatedClientLoader, in particular, when 
> "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a 
> conflicting thrift lib in user jar, which results in exception:
> {code:java}
>  21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.sql.AnalysisException: 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> 

[jira] [Assigned] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35248:


Assignee: (was: Apache Spark)

> Spark shall load system class first in IsolatedClientLoader
> ---
>
> Key: SPARK-35248
> URL: https://issues.apache.org/jira/browse/SPARK-35248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Daniel Dai
>Priority: Major
>
> This happens when Spark try to load HMS client jars using 
> IsolatedClientLoader, in particular, when 
> "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a 
> conflicting thrift lib in user jar, which results in exception:
> {code:java}
>  21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.sql.AnalysisException: 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> 

[jira] [Assigned] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35248:


Assignee: Apache Spark

> Spark shall load system class first in IsolatedClientLoader
> ---
>
> Key: SPARK-35248
> URL: https://issues.apache.org/jira/browse/SPARK-35248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.1.1
>Reporter: Daniel Dai
>Assignee: Apache Spark
>Priority: Major
>
> This happens when Spark try to load HMS client jars using 
> IsolatedClientLoader, in particular, when 
> "spark.sql.hive.metastore.jars"="builtin" (default). Spark try to load a 
> conflicting thrift lib in user jar, which results in exception:
> {code:java}
>  21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
> User class threw exception: org.apache.spark.sql.AnalysisException: 
> java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
>   at 
> org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
>   at 
> org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> 

[jira] [Created] (SPARK-35248) Spark shall load system class first in IsolatedClientLoader

2021-04-27 Thread Daniel Dai (Jira)
Daniel Dai created SPARK-35248:
--

 Summary: Spark shall load system class first in 
IsolatedClientLoader
 Key: SPARK-35248
 URL: https://issues.apache.org/jira/browse/SPARK-35248
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 2.4.7
Reporter: Daniel Dai


This happens when Spark try to load HMS client jars using IsolatedClientLoader, 
in particular, when "spark.sql.hive.metastore.jars"="builtin" (default). Spark 
try to load a conflicting thrift lib in user jar, which results in exception:
{code:java}
 21/04/22 04:10:53 ERROR [main] yarn.Client: Application diagnostics message: 
User class threw exception: org.apache.spark.sql.AnalysisException: 
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:220)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at 
org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at 
org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:141)
at 
org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:136)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$2.apply(HiveSessionStateBuilder.scala:55)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:91)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:91)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.isTemporaryTable(SessionCatalog.scala:738)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.isRunningDirectlyOnFiles(Analyzer.scala:748)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:682)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:714)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:707)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$apply$1.apply(AnalysisHelper.scala:90)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:89)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:87)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1.apply(AnalysisHelper.scala:86)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsUp(AnalysisHelper.scala:86)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsUp(LogicalPlan.scala:29)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsUp$1$$anonfun$1.apply(AnalysisHelper.scala:87)
at 

[jira] [Commented] (SPARK-35247) Add `mask` built-in function as a data masking function

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333548#comment-17333548
 ] 

Apache Spark commented on SPARK-35247:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32372

> Add `mask` built-in function as a data masking function
> ---
>
> Key: SPARK-35247
> URL: https://issues.apache.org/jira/browse/SPARK-35247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there are no built-in data masking function.
> Hive already has such functions so it's great to have.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35247) Add `mask` built-in function as a data masking function

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35247:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Add `mask` built-in function as a data masking function
> ---
>
> Key: SPARK-35247
> URL: https://issues.apache.org/jira/browse/SPARK-35247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there are no built-in data masking function.
> Hive already has such functions so it's great to have.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35247) Add `mask` built-in function as a data masking function

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35247:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Add `mask` built-in function as a data masking function
> ---
>
> Key: SPARK-35247
> URL: https://issues.apache.org/jira/browse/SPARK-35247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> In the current master, there are no built-in data masking function.
> Hive already has such functions so it's great to have.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35247) Add `mask` built-in function as a data masking function

2021-04-27 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35247:
--

 Summary: Add `mask` built-in function as a data masking function
 Key: SPARK-35247
 URL: https://issues.apache.org/jira/browse/SPARK-35247
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, there are no built-in data masking function.
Hive already has such functions so it's great to have.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35247) Add `mask` built-in function as a data masking function

2021-04-27 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35247:
---
Issue Type: New Feature  (was: Bug)

> Add `mask` built-in function as a data masking function
> ---
>
> Key: SPARK-35247
> URL: https://issues.apache.org/jira/browse/SPARK-35247
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In the current master, there are no built-in data masking function.
> Hive already has such functions so it's great to have.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35150) Accelerate fallback BLAS with dev.ludovic.netlib

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35150:


Assignee: Ludovic Henry

> Accelerate fallback BLAS with dev.ludovic.netlib
> 
>
> Key: SPARK-35150
> URL: https://issues.apache.org/jira/browse/SPARK-35150
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
>
> Following https://github.com/apache/spark/pull/30810, I've continued looking 
> for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate 
> work done in the [{{dev.ludovic.netlib}}|https://github.com/luhenry/netlib/] 
> Maven package.
> The {{dev.ludovic.netlib}} library wraps the original 
> {{com.github.fommil.netlib}} library and focus on accelerating the linear 
> algebra routines in use in Spark. When running the 
> {{org.apache.spark.ml.linalg.BLASBenchmark}}benchmarking suite, I get the 
> results at [1] on an Intel machine. Moreover, this library is thoroughly 
> tested to return the exact same results as the reference implementation.
> Under the hood, it reimplements the necessary algorithms in pure 
> autovectorization-friendly Java 8, as well as takes advantage of the Vector 
> API and Foreign Linker API introduced in JDK 16 when available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35150) Accelerate fallback BLAS with dev.ludovic.netlib

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35150.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32253
[https://github.com/apache/spark/pull/32253]

> Accelerate fallback BLAS with dev.ludovic.netlib
> 
>
> Key: SPARK-35150
> URL: https://issues.apache.org/jira/browse/SPARK-35150
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, ML, MLlib
>Affects Versions: 3.2.0
>Reporter: Ludovic Henry
>Assignee: Ludovic Henry
>Priority: Major
> Fix For: 3.2.0
>
>
> Following https://github.com/apache/spark/pull/30810, I've continued looking 
> for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate 
> work done in the [{{dev.ludovic.netlib}}|https://github.com/luhenry/netlib/] 
> Maven package.
> The {{dev.ludovic.netlib}} library wraps the original 
> {{com.github.fommil.netlib}} library and focus on accelerating the linear 
> algebra routines in use in Spark. When running the 
> {{org.apache.spark.ml.linalg.BLASBenchmark}}benchmarking suite, I get the 
> results at [1] on an Intel machine. Moreover, this library is thoroughly 
> tested to return the exact same results as the reference implementation.
> Under the hood, it reimplements the necessary algorithms in pure 
> autovectorization-friendly Java 8, as well as takes advantage of the Vector 
> API and Foreign Linker API introduced in JDK 16 when available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34979:
-
Component/s: Documentation

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Trivial
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34979:
-
Priority: Trivial  (was: Major)

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Trivial
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35246:


Assignee: (was: Apache Spark)

> Streaming-batch intersects are incorrectly allowed through 
> UnsupportedOperationsChecker
> ---
>
> Key: SPARK-35246
> URL: https://issues.apache.org/jira/browse/SPARK-35246
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 3.2.0
>
>
> The UnsupportedOperationChecker currently rejects streaming intersects only 
> if both sides are streaming, but they don't work if even one side is 
> streaming. The following simple test, for example, fails with a cryptic 
> None.get error because the state store can't plan itself properly.
> {code:java}
>   test("intersect") {
>     val input = MemoryStream[Long]
>     val df = input.toDS().intersect(spark.range(10).as[Long])
>     testStream(df) (
>       AddData(input, 1L),
>       CheckAnswer(1)
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35246:


Assignee: Apache Spark

> Streaming-batch intersects are incorrectly allowed through 
> UnsupportedOperationsChecker
> ---
>
> Key: SPARK-35246
> URL: https://issues.apache.org/jira/browse/SPARK-35246
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> The UnsupportedOperationChecker currently rejects streaming intersects only 
> if both sides are streaming, but they don't work if even one side is 
> streaming. The following simple test, for example, fails with a cryptic 
> None.get error because the state store can't plan itself properly.
> {code:java}
>   test("intersect") {
>     val input = MemoryStream[Long]
>     val df = input.toDS().intersect(spark.range(10).as[Long])
>     testStream(df) (
>       AddData(input, 1L),
>       CheckAnswer(1)
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333404#comment-17333404
 ] 

Apache Spark commented on SPARK-35246:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/32371

> Streaming-batch intersects are incorrectly allowed through 
> UnsupportedOperationsChecker
> ---
>
> Key: SPARK-35246
> URL: https://issues.apache.org/jira/browse/SPARK-35246
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 3.2.0
>
>
> The UnsupportedOperationChecker currently rejects streaming intersects only 
> if both sides are streaming, but they don't work if even one side is 
> streaming. The following simple test, for example, fails with a cryptic 
> None.get error because the state store can't plan itself properly.
> {code:java}
>   test("intersect") {
>     val input = MemoryStream[Long]
>     val df = input.toDS().intersect(spark.range(10).as[Long])
>     testStream(df) (
>       AddData(input, 1L),
>       CheckAnswer(1)
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35246) Streaming-batch intersects are incorrectly allowed through UnsupportedOperationsChecker

2021-04-27 Thread Jose Torres (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jose Torres updated SPARK-35246:

Summary: Streaming-batch intersects are incorrectly allowed through 
UnsupportedOperationsChecker  (was: Disable intersects for all streaming 
queries)

> Streaming-batch intersects are incorrectly allowed through 
> UnsupportedOperationsChecker
> ---
>
> Key: SPARK-35246
> URL: https://issues.apache.org/jira/browse/SPARK-35246
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 3.2.0
>
>
> The UnsupportedOperationChecker currently rejects streaming intersects only 
> if both sides are streaming, but they don't work if even one side is 
> streaming. The following simple test, for example, fails with a cryptic 
> None.get error because the state store can't plan itself properly.
> {code:java}
>   test("intersect") {
>     val input = MemoryStream[Long]
>     val df = input.toDS().intersect(spark.range(10).as[Long])
>     testStream(df) (
>       AddData(input, 1L),
>       CheckAnswer(1)
>     )
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35246) Disable intersects for all streaming queries

2021-04-27 Thread Jose Torres (Jira)
Jose Torres created SPARK-35246:
---

 Summary: Disable intersects for all streaming queries
 Key: SPARK-35246
 URL: https://issues.apache.org/jira/browse/SPARK-35246
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.0, 3.0.0
Reporter: Jose Torres
 Fix For: 3.2.0


The UnsupportedOperationChecker currently rejects streaming intersects only if 
both sides are streaming, but they don't work if even one side is streaming. 
The following simple test, for example, fails with a cryptic None.get error 
because the state store can't plan itself properly.
{code:java}
  test("intersect") {
    val input = MemoryStream[Long]
    val df = input.toDS().intersect(spark.range(10).as[Long])
    testStream(df) (
      AddData(input, 1L),
      CheckAnswer(1)
    )
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35245) DynamicFilter pushdown not working

2021-04-27 Thread jean-claude (Jira)
jean-claude created SPARK-35245:
---

 Summary: DynamicFilter pushdown not working
 Key: SPARK-35245
 URL: https://issues.apache.org/jira/browse/SPARK-35245
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: jean-claude


 

The pushed filters is always empty. `PushedFilters: []`

I was expecting the filters to be pushed down on the probe side of the join.

Not sure how to properly configure this to work. For example how to set 
fallbackFilterRatio ?


spark = ( SparkSession.builder
 .master('local')
 .config("spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio", 100)
 .getOrCreate()
 )
 
df = 
spark.read.parquet('abfss://warehouse@/iceberg/opensource//data/timeperiod=2021-04-25/0-0-929b48ef-7ec3-47bd-b0a1-e9172c2dca6a-1.parquet')
df.createOrReplaceTempView('TMP')
 
spark.sql('''
explain cost
select 
   timeperiod, rrname
from
   TMP
where
   timeperiod in (
 select
   TO_DATE(d, 'MM-dd-') AS timeperiod
 from
 values
   ('01-01-2021'),
   ('01-01-2021'),
   ('01-01-2021') tbl(d)
 )
group by
   timeperiod, rrname
''').show(truncate=False)

|== Optimized Logical Plan ==
Aggregate [timeperiod#597, rrname#594], [timeperiod#597, rrname#594], 
Statistics(sizeInBytes=69.0 MiB)
+- Join LeftSemi, (timeperiod#597 = timeperiod#669), 
Statistics(sizeInBytes=69.0 MiB)
 :- Project [rrname#594, timeperiod#597], Statistics(sizeInBytes=69.0 MiB)
 : +- 
Relation[count#591,time_first#592L,time_last#593L,rrname#594,rrtype#595,rdata#596,timeperiod#597]
 parquet, Statistics(sizeInBytes=198.5 MiB)
 +- LocalRelation [timeperiod#669], Statistics(sizeInBytes=36.0 B)

== Physical Plan ==
*(2) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], 
output=[timeperiod#597, rrname#594])
+- Exchange hashpartitioning(timeperiod#597, rrname#594, 200), 
ENSURE_REQUIREMENTS, [id=#839]
 +- *(1) HashAggregate(keys=[timeperiod#597, rrname#594], functions=[], 
output=[timeperiod#597, rrname#594])
 +- *(1) BroadcastHashJoin [timeperiod#597], [timeperiod#669], LeftSemi, 
BuildRight, false
 :- *(1) ColumnarToRow
 : +- FileScan parquet [rrname#594,timeperiod#597] Batched: true, DataFilters: 
[], Format: Parquet, Location: 
InMemoryFileIndex[abfss://warehouse@/iceberg/opensource/..., 
PartitionFilters: [], PushedFilters: [], ReadSchema: 
struct
 +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, date, 
true]),false), [id=#822]
 +- LocalTableScan [timeperiod#669]
 
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35244:


Assignee: Wenchen Fan  (was: Apache Spark)

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35244:


Assignee: Apache Spark  (was: Wenchen Fan)

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35244:


Assignee: Apache Spark  (was: Wenchen Fan)

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1756#comment-1756
 ] 

Apache Spark commented on SPARK-35244:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32370

> invoke should throw the original exception
> --
>
> Key: SPARK-35244
> URL: https://issues.apache.org/jira/browse/SPARK-35244
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35244) invoke should throw the original exception

2021-04-27 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-35244:
---

 Summary: invoke should throw the original exception
 Key: SPARK-35244
 URL: https://issues.apache.org/jira/browse/SPARK-35244
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35238) Add JindoFS SDK in cloud integration documents

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-35238.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32360
[https://github.com/apache/spark/pull/32360]

> Add JindoFS SDK in cloud integration documents
> --
>
> Key: SPARK-35238
> URL: https://issues.apache.org/jira/browse/SPARK-35238
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.1.1
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 3.2.0
>
>
> As an important cloud provider, Alibaba Cloud presents JindoFS SDK to 
> maximize the performance for workloads interacting with Alibaba Cloud OSS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35238) Add JindoFS SDK in cloud integration documents

2021-04-27 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-35238:


Assignee: Adrian Wang

> Add JindoFS SDK in cloud integration documents
> --
>
> Key: SPARK-35238
> URL: https://issues.apache.org/jira/browse/SPARK-35238
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.1.1
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
>
> As an important cloud provider, Alibaba Cloud presents JindoFS SDK to 
> maximize the performance for workloads interacting with Alibaba Cloud OSS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35243) Support columnar execution on ANSI interval types

2021-04-27 Thread Max Gekk (Jira)
Max Gekk created SPARK-35243:


 Summary: Support columnar execution on ANSI interval types
 Key: SPARK-35243
 URL: https://issues.apache.org/jira/browse/SPARK-35243
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35176) Raise TypeError in inappropriate type case rather than ValueError

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333246#comment-17333246
 ] 

Apache Spark commented on SPARK-35176:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32368

>  Raise TypeError in inappropriate type case rather than ValueError
> --
>
> Key: SPARK-35176
> URL: https://issues.apache.org/jira/browse/SPARK-35176
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Minor
>
> There are many wrong error type usages on ValueError type.
> When an operation or function is applied to an object of inappropriate type, 
> we should use TypeError rather than ValueError.
> such as:
> [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1137]
> [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1228]
>  
> We should do some correction in some right time, note that if we do these 
> corrections, it will break some catch on original ValueError.
>  
> [1] https://docs.python.org/3/library/exceptions.html#TypeError



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35091:
---

Assignee: Kent Yao

> Support ANSI intervals by date_part()
> -
>
> Key: SPARK-35091
> URL: https://issues.apache.org/jira/browse/SPARK-35091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Kent Yao
>Priority: Major
>
> Support year-month and day-time intervals by date_part(). For example:
> {code:sql}
>   > SELECT date_part('days', interval '5 0:0:0' day to second);
>5
>   > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35091.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32351
[https://github.com/apache/spark/pull/32351]

> Support ANSI intervals by date_part()
> -
>
> Key: SPARK-35091
> URL: https://issues.apache.org/jira/browse/SPARK-35091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.2.0
>
>
> Support year-month and day-time intervals by date_part(). For example:
> {code:sql}
>   > SELECT date_part('days', interval '5 0:0:0' day to second);
>5
>   > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD

2021-04-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35239:
---

Assignee: ulysses you

>  Coalesce shuffle partition should handle empty input RDD
> -
>
> Key: SPARK-35239
> URL: https://issues.apache.org/jira/browse/SPARK-35239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> If input RDD partition is empty then the map output statistics will be null. 
> And if all shuffle stage's input RDD partition is empty, we will skip it and 
> lose the chance to coalesce partition.
>  
> We can simply create a empty partition for these custom shuffle reader to 
> reduce the partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD

2021-04-27 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35239.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32362
[https://github.com/apache/spark/pull/32362]

>  Coalesce shuffle partition should handle empty input RDD
> -
>
> Key: SPARK-35239
> URL: https://issues.apache.org/jira/browse/SPARK-35239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> If input RDD partition is empty then the map output statistics will be null. 
> And if all shuffle stage's input RDD partition is empty, we will skip it and 
> lose the chance to coalesce partition.
>  
> We can simply create a empty partition for these custom shuffle reader to 
> reduce the partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-35020) Group exception messages in catalyst/util

2021-04-27 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35020:
---
Comment: was deleted

(was: I'm working on.)

> Group exception messages in catalyst/util
> -
>
> Key: SPARK-35020
> URL: https://issues.apache.org/jira/browse/SPARK-35020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util'
> || Filename  ||   Count ||
> | ArrayBasedMapBuilder.scala|   3 |
> | DateTimeFormatterHelper.scala |   3 |
> | DateTimeUtils.scala   |   3 |
> | IntervalUtils.scala   |   2 |
> | TimestampFormatter.scala  |   1 |
> | TypeUtils.scala   |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35020) Group exception messages in catalyst/util

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333131#comment-17333131
 ] 

Apache Spark commented on SPARK-35020:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32367

> Group exception messages in catalyst/util
> -
>
> Key: SPARK-35020
> URL: https://issues.apache.org/jira/browse/SPARK-35020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util'
> || Filename  ||   Count ||
> | ArrayBasedMapBuilder.scala|   3 |
> | DateTimeFormatterHelper.scala |   3 |
> | DateTimeUtils.scala   |   3 |
> | IntervalUtils.scala   |   2 |
> | TimestampFormatter.scala  |   1 |
> | TypeUtils.scala   |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35020) Group exception messages in catalyst/util

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35020:


Assignee: Apache Spark

> Group exception messages in catalyst/util
> -
>
> Key: SPARK-35020
> URL: https://issues.apache.org/jira/browse/SPARK-35020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util'
> || Filename  ||   Count ||
> | ArrayBasedMapBuilder.scala|   3 |
> | DateTimeFormatterHelper.scala |   3 |
> | DateTimeUtils.scala   |   3 |
> | IntervalUtils.scala   |   2 |
> | TimestampFormatter.scala  |   1 |
> | TypeUtils.scala   |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35020) Group exception messages in catalyst/util

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35020:


Assignee: (was: Apache Spark)

> Group exception messages in catalyst/util
> -
>
> Key: SPARK-35020
> URL: https://issues.apache.org/jira/browse/SPARK-35020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util'
> || Filename  ||   Count ||
> | ArrayBasedMapBuilder.scala|   3 |
> | DateTimeFormatterHelper.scala |   3 |
> | DateTimeUtils.scala   |   3 |
> | IntervalUtils.scala   |   2 |
> | TimestampFormatter.scala  |   1 |
> | TypeUtils.scala   |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34878:


Assignee: (was: Apache Spark)

> Test actual size of year-month and day-time intervals
> -
>
> Key: SPARK-34878
> URL: https://issues.apache.org/jira/browse/SPARK-34878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Add tests to 
> https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34878) Test actual size of year-month and day-time intervals

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34878:


Assignee: Apache Spark

> Test actual size of year-month and day-time intervals
> -
>
> Key: SPARK-34878
> URL: https://issues.apache.org/jira/browse/SPARK-34878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Add tests to 
> https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34878) Test actual size of year-month and day-time intervals

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333106#comment-17333106
 ] 

Apache Spark commented on SPARK-34878:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/32366

> Test actual size of year-month and day-time intervals
> -
>
> Key: SPARK-34878
> URL: https://issues.apache.org/jira/browse/SPARK-34878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Add tests to 
> https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnTypeSuite.scala#L52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-35060) Group exception messages in sql/types

2021-04-27 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35060:
---
Comment: was deleted

(was: I'm working on.)

> Group exception messages in sql/types
> -
>
> Key: SPARK-35060
> URL: https://issues.apache.org/jira/browse/SPARK-35060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/types'
> || Filename ||   Count ||
> | Decimal.scala|   4 |
> | Metadata.scala   |   4 |
> | StructType.scala |   1 |
> | numerics.scala   |   7 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35020) Group exception messages in catalyst/util

2021-04-27 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333101#comment-17333101
 ] 

jiaan.geng commented on SPARK-35020:


I'm working on.

> Group exception messages in catalyst/util
> -
>
> Key: SPARK-35020
> URL: https://issues.apache.org/jira/browse/SPARK-35020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util'
> || Filename  ||   Count ||
> | ArrayBasedMapBuilder.scala|   3 |
> | DateTimeFormatterHelper.scala |   3 |
> | DateTimeUtils.scala   |   3 |
> | IntervalUtils.scala   |   2 |
> | TimestampFormatter.scala  |   1 |
> | TypeUtils.scala   |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333096#comment-17333096
 ] 

Apache Spark commented on SPARK-35228:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32365

> Add expression ToHiveString for keep consistent between hive/spark format in 
> df.show and transform
> --
>
> Key: SPARK-35228
> URL: https://issues.apache.org/jira/browse/SPARK-35228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/32335#discussion_r620027850] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35228:


Assignee: Apache Spark

> Add expression ToHiveString for keep consistent between hive/spark format in 
> df.show and transform
> --
>
> Key: SPARK-35228
> URL: https://issues.apache.org/jira/browse/SPARK-35228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/32335#discussion_r620027850] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35228:


Assignee: (was: Apache Spark)

> Add expression ToHiveString for keep consistent between hive/spark format in 
> df.show and transform
> --
>
> Key: SPARK-35228
> URL: https://issues.apache.org/jira/browse/SPARK-35228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/32335#discussion_r620027850] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35228) Add expression ToHiveString for keep consistent between hive/spark format in df.show and transform

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333094#comment-17333094
 ] 

Apache Spark commented on SPARK-35228:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32365

> Add expression ToHiveString for keep consistent between hive/spark format in 
> df.show and transform
> --
>
> Key: SPARK-35228
> URL: https://issues.apache.org/jira/browse/SPARK-35228
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: angerszhu
>Priority: Major
>
> According to 
> [https://github.com/apache/spark/pull/32335#discussion_r620027850] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35217) com.google.protobuf.Parser.parseFrom() method Can't use in spark

2021-04-27 Thread ShuDaoNan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ShuDaoNan resolved SPARK-35217.
---
Resolution: Incomplete

Amazon emr自带的protobuf-java-version.jar 版本过低导致

>  com.google.protobuf.Parser.parseFrom() method Can't use in spark 
> --
>
> Key: SPARK-35217
> URL: https://issues.apache.org/jira/browse/SPARK-35217
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, ML
>Affects Versions: 2.4.7
> Environment: *platform:*
>  Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 
> [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
>  Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1
>  (Windows10 do not have this problem.)
>Reporter: ShuDaoNan
>Priority: Trivial
>
> *platform:*
>  Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 
> [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
>  Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1
>  (Windows10 do not have this problem.)
> *problem:*
>  Can not load TensorFlow model(Java or scala):
> {code:java}
> [hadoop@ip-172-17-1-1 ~]$ spark-shell  --master local[2] --jars 
> s3://jars/jasypt-1.9.2.jar,s3://jars/commons-pool2-2.0.jar,s3://jars/tensorflow-core-api-0.3.1.jar,s3://jars/tensorflow-core-api-0.3.1-linux-x86_64-mkl.jar,s3://jars/ndarray-0.3.1.jar,s3://jars/javacpp-1.5.4.jar,s3://jars/tensorflow-core-platform-0.3.1.jar,s3://jars/protobuf-java-3.8.0.jar
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7-amzn-1
>   /_/
>  
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.{tensorflow => tf}
> import org.{tensorflow=>tf}
> scala> val bundle = tf.SavedModelBundle.load("/home/hadoop/xDeepFM","serve")
> 2021-04-23 07:32:56.223881: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading 
> SavedModel from: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.266424: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta 
> graph with tags { serve }
> 2021-04-23 07:32:56.266468: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading 
> SavedModel debug info (if present) from: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.346757: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring 
> SavedModel bundle.
> 2021-04-23 07:32:56.873838: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running 
> initialization op on SavedModel bundle at path: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.928656: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel 
> load for tags { serve }; Status: success: OK. Took 704788 microseconds.
> java.lang.NoSuchMethodError: 
> com.google.protobuf.Parser.parseFrom(Ljava/nio/ByteBuffer;)Ljava/lang/Object;
>   at 
> org.tensorflow.proto.framework.MetaGraphDef.parseFrom(MetaGraphDef.java:3067)
>   at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:422)
>   at org.tensorflow.SavedModelBundle.access$000(SavedModelBundle.java:59)
>   at org.tensorflow.SavedModelBundle$Loader.load(SavedModelBundle.java:68)
>   at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:242)
>   ... 49 elided
> scala>
> {code}
> *But I can load the model in a single 'Testtf.java' like that:*
> {code:java}
> [hadoop@ip-172-17-1-1 ~]$ vi Testtf.java 
> import org.tensorflow.SavedModelBundle;
> public class Testtf {
> public static void main(String[] args) {
> System.out.println("test load...");
> SavedModelBundle bundle = 
> org.tensorflow.SavedModelBundle.load("/home/hadoop/xDeepFM","serve");
> System.out.println("loaded bundle...");
> System.out.println(bundle);
> }
> }
> [hadoop@ip-172-17-1-1 ~]$ javac -cp 
> javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar
>  Testtf.java
> [hadoop@ip-172-17-1-1 ~]$ java -cp 
> javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar:.
>  Testtf
> test load...
> 2021-04-25 02:33:47.247120: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading 
> SavedModel from: /home/hadoop/xDeepFM
> 2021-04-25 

[jira] [Updated] (SPARK-35217) com.google.protobuf.Parser.parseFrom() method Can't use in spark

2021-04-27 Thread ShuDaoNan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ShuDaoNan updated SPARK-35217:
--
Priority: Trivial  (was: Major)

>  com.google.protobuf.Parser.parseFrom() method Can't use in spark 
> --
>
> Key: SPARK-35217
> URL: https://issues.apache.org/jira/browse/SPARK-35217
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, ML
>Affects Versions: 2.4.7
> Environment: *platform:*
>  Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 
> [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
>  Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1
>  (Windows10 do not have this problem.)
>Reporter: ShuDaoNan
>Priority: Trivial
>
> *platform:*
>  Linux ip-172-17-1-1 4.14.219-164.354.amzn2.x86_64 
> [#1|https://github.com/apache/spark/pull/1] SMP Mon Feb 22 21:18:39 UTC 2021 
> x86_64 x86_64 x86_64 GNU/Linux
>  Hive 2.3.7, ZooKeeper 3.4.14, Spark 2.4.7, TensorFlow 2.4.1
>  (Windows10 do not have this problem.)
> *problem:*
>  Can not load TensorFlow model(Java or scala):
> {code:java}
> [hadoop@ip-172-17-1-1 ~]$ spark-shell  --master local[2] --jars 
> s3://jars/jasypt-1.9.2.jar,s3://jars/commons-pool2-2.0.jar,s3://jars/tensorflow-core-api-0.3.1.jar,s3://jars/tensorflow-core-api-0.3.1-linux-x86_64-mkl.jar,s3://jars/ndarray-0.3.1.jar,s3://jars/javacpp-1.5.4.jar,s3://jars/tensorflow-core-platform-0.3.1.jar,s3://jars/protobuf-java-3.8.0.jar
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7-amzn-1
>   /_/
>  
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> import org.{tensorflow => tf}
> import org.{tensorflow=>tf}
> scala> val bundle = tf.SavedModelBundle.load("/home/hadoop/xDeepFM","serve")
> 2021-04-23 07:32:56.223881: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading 
> SavedModel from: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.266424: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:55] Reading meta 
> graph with tags { serve }
> 2021-04-23 07:32:56.266468: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:93] Reading 
> SavedModel debug info (if present) from: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.346757: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:206] Restoring 
> SavedModel bundle.
> 2021-04-23 07:32:56.873838: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:190] Running 
> initialization op on SavedModel bundle at path: /home/hadoop/xDeepFM
> 2021-04-23 07:32:56.928656: I 
> external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel 
> load for tags { serve }; Status: success: OK. Took 704788 microseconds.
> java.lang.NoSuchMethodError: 
> com.google.protobuf.Parser.parseFrom(Ljava/nio/ByteBuffer;)Ljava/lang/Object;
>   at 
> org.tensorflow.proto.framework.MetaGraphDef.parseFrom(MetaGraphDef.java:3067)
>   at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:422)
>   at org.tensorflow.SavedModelBundle.access$000(SavedModelBundle.java:59)
>   at org.tensorflow.SavedModelBundle$Loader.load(SavedModelBundle.java:68)
>   at org.tensorflow.SavedModelBundle.load(SavedModelBundle.java:242)
>   ... 49 elided
> scala>
> {code}
> *But I can load the model in a single 'Testtf.java' like that:*
> {code:java}
> [hadoop@ip-172-17-1-1 ~]$ vi Testtf.java 
> import org.tensorflow.SavedModelBundle;
> public class Testtf {
> public static void main(String[] args) {
> System.out.println("test load...");
> SavedModelBundle bundle = 
> org.tensorflow.SavedModelBundle.load("/home/hadoop/xDeepFM","serve");
> System.out.println("loaded bundle...");
> System.out.println(bundle);
> }
> }
> [hadoop@ip-172-17-1-1 ~]$ javac -cp 
> javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar
>  Testtf.java
> [hadoop@ip-172-17-1-1 ~]$ java -cp 
> javacpp-1.5.4-linux-x86_64.jar:tensorflow-core-api-0.3.1-linux-x86_64.jar:tensorflow-core-api-0.3.1.jar:ndarray-0.3.1.jar:javacpp-1.5.4.jar:tensorflow-core-platform-0.3.1.jar:tensorflow-core-platform-mkl-0.3.1.jar:protobuf-java-3.8.0.jar:.
>  Testtf
> test load...
> 2021-04-25 02:33:47.247120: I 
> external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:32] Reading 
> SavedModel from: /home/hadoop/xDeepFM
> 2021-04-25 02:33:47.275349: I 
> 

[jira] [Assigned] (SPARK-35242) Support change catalog default database for spark

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35242:


Assignee: (was: Apache Spark)

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35242) Support change catalog default database for spark

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333055#comment-17333055
 ] 

Apache Spark commented on SPARK-35242:
--

User 'hddong' has created a pull request for this issue:
https://github.com/apache/spark/pull/32364

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35242) Support change catalog default database for spark

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35242:


Assignee: Apache Spark

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Assignee: Apache Spark
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35242) Support change catalog default database for spark

2021-04-27 Thread hong dongdong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated SPARK-35242:
--
Summary: Support change catalog default database for spark  (was: Support 
change default database for spark sql)

> Support change catalog default database for spark
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35242) Support change default database for spark sql

2021-04-27 Thread hong dongdong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong updated SPARK-35242:
--
Description: Spark catalog default database can only be 'default'. When we 
can not access 'default', we will get Exception 'Permission denied:'. We should 
support change default datbase for catalog like 'jdbc/thrift' does.

> Support change default database for spark sql
> -
>
> Key: SPARK-35242
> URL: https://issues.apache.org/jira/browse/SPARK-35242
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Priority: Major
>
> Spark catalog default database can only be 'default'. When we can not access 
> 'default', we will get Exception 'Permission denied:'. We should support 
> change default datbase for catalog like 'jdbc/thrift' does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333038#comment-17333038
 ] 

Yikun Jiang edited comment on SPARK-34979 at 4/27/21, 8:34 AM:
---

Note that arrow 4.0.0 has been published, and pyarrow has been published today, 

[1] [https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0]

[2] [https://pypi.org/project/pyarrow/4.0.0/]

 


was (Author: yikunkero):
Note that arrow 4.0.0 has been published, and pyarrow has been published today, 

[1] 
[https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.[2]0|https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0]

[2] [https://pypi.org/project/pyarrow/4.0.0/]

 

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333039#comment-17333039
 ] 

Apache Spark commented on SPARK-34979:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32363

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333038#comment-17333038
 ] 

Yikun Jiang commented on SPARK-34979:
-

Note that arrow 4.0.0 has been published, and pyarrow has been published today, 

[1] 
[https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.[2]0|https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0]

[2] [https://pypi.org/project/pyarrow/4.0.0/]

 

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34979:


Assignee: Apache Spark

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34979) Failed to install pyspark[sql] (due to pyarrow error) on aarch64

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34979:


Assignee: (was: Apache Spark)

> Failed to install pyspark[sql] (due to pyarrow error) on aarch64
> 
>
> Key: SPARK-34979
> URL: https://issues.apache.org/jira/browse/SPARK-34979
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Major
>
> ~^$ pip install pyspark[sql]^~
> ~^Collecting pyarrow>=1.0.0^~
>  ~^Using cached pyarrow-3.0.0.tar.gz (682 kB)^~
>  ~^Installing build dependencies ... done^~
>  ~^Getting requirements to build wheel ... done^~
>  ~^Preparing wheel metadata ... done^~
>  ~^// ... ...^~
>  ~^Building wheels for collected packages: pyarrow^~
>  ~^Building wheel for pyarrow (PEP 517) ... error^~
>  ~^ERROR: Command errored out with exit status 1:^~
>  ~^command: /root/venv/bin/python3.8 /tmp/tmpv35m1o0g build_wheel 
> /tmp/tmpq0n5juib^~
>  ~^cwd: /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^Complete output (183 lines):^~
> ~^– Running cmake for pyarrow^~
>  ~^cmake -DPYTHON_EXECUTABLE=/root/venv/bin/python3.8 
> -DPython3_EXECUTABLE=/root/venv/bin/python3.8 -DPYARROW_BUILD_CUDA=off 
> -DPYARROW_BUILD_FLIGHT=off -DPYARROW_BUILD_GANDIVA=off 
> -DPYARROW_BUILD_DATASET=off -DPYARROW_BUILD_ORC=off 
> -DPYARROW_BUILD_PARQUET=off -DPYARROW_BUILD_PLASMA=off -DPYARROW_BUILD_S3=off 
> -DPYARROW_BUILD_HDFS=off -DPYARROW_USE_TENSORFLOW=off 
> -DPYARROW_BUNDLE_ARROW_CPP=off -DPYARROW_BUNDLE_BOOST=off 
> -DPYARROW_GENERATE_COVERAGE=off -DPYARROW_BOOST_USE_SHARED=on 
> -DPYARROW_PARQUET_USE_SHARED=on -DCMAKE_BUILD_TYPE=release 
> /tmp/pip-install-sh0myu71/pyarrow^~
>  ~^error: command 'cmake' failed with exit status 1^~
>  ~^^~
>  ~^ERROR: Failed building wheel for pyarrow^~
>  ~^Failed to build pyarrow^~
>  ~^ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be 
> installed directly^~
>  
> The pip installation would be failed, due to the dependency pyarrow install 
> failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35242) Support change default database for spark sql

2021-04-27 Thread hong dongdong (Jira)
hong dongdong created SPARK-35242:
-

 Summary: Support change default database for spark sql
 Key: SPARK-35242
 URL: https://issues.apache.org/jira/browse/SPARK-35242
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: hong dongdong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35239:


Assignee: (was: Apache Spark)

>  Coalesce shuffle partition should handle empty input RDD
> -
>
> Key: SPARK-35239
> URL: https://issues.apache.org/jira/browse/SPARK-35239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> If input RDD partition is empty then the map output statistics will be null. 
> And if all shuffle stage's input RDD partition is empty, we will skip it and 
> lose the chance to coalesce partition.
>  
> We can simply create a empty partition for these custom shuffle reader to 
> reduce the partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35239:


Assignee: Apache Spark

>  Coalesce shuffle partition should handle empty input RDD
> -
>
> Key: SPARK-35239
> URL: https://issues.apache.org/jira/browse/SPARK-35239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> If input RDD partition is empty then the map output statistics will be null. 
> And if all shuffle stage's input RDD partition is empty, we will skip it and 
> lose the chance to coalesce partition.
>  
> We can simply create a empty partition for these custom shuffle reader to 
> reduce the partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35239) Coalesce shuffle partition should handle empty input RDD

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333016#comment-17333016
 ] 

Apache Spark commented on SPARK-35239:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32362

>  Coalesce shuffle partition should handle empty input RDD
> -
>
> Key: SPARK-35239
> URL: https://issues.apache.org/jira/browse/SPARK-35239
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> If input RDD partition is empty then the map output statistics will be null. 
> And if all shuffle stage's input RDD partition is empty, we will skip it and 
> lose the chance to coalesce partition.
>  
> We can simply create a empty partition for these custom shuffle reader to 
> reduce the partition number.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35241) Investigate to prefer vectorized hash map in hash aggregate selectively

2021-04-27 Thread Cheng Su (Jira)
Cheng Su created SPARK-35241:


 Summary: Investigate to prefer vectorized hash map in hash 
aggregate selectively
 Key: SPARK-35241
 URL: https://issues.apache.org/jira/browse/SPARK-35241
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


In hash aggregate, we always use row-based hash map as first level hash map in 
production, and use vectorized hash map in testing / benchmarking. However we 
do find in micro-benchmark that vectorized hash map is better than row-based 
hash map e.g. with single key - 
[https://github.com/apache/spark/pull/32357#discussion_r620914345] . So we 
should re-evaluate the decision to always use row-based hash map or not. And 
maybe come up with a more adaptive decision policy to choose which map to use 
depending on keys / values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-04-27 Thread Anthony (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333000#comment-17333000
 ] 

Anthony edited comment on SPARK-18105 at 4/27/21, 7:26 AM:
---

We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible 
but often seen
{code:java}
org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
 at java.io.DataInputStream.readInt(DataInputStream.java:387)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.(UnsafeRowSerializer.scala:120)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.asKeyValueIterator(UnsafeRowSerializer.scala:110)
 at 
org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$2(BlockStoreShuffleReader.scala:92)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.sort_addToSorter_0$(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
 at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:860)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:730)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:254)
 at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
 Caused by: java.io.IOException: Stream is corrupted
 at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:250)
 at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:819)
 ... 35 more
 Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 10122 of input 
buffer
 at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
 at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:245)
 ... 37 more
{code}


was (Author: jangryeo):
We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible 
but often seen

org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
 at java.io.DataInputStream.readInt(DataInputStream.java:387)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
 at 

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-04-27 Thread Anthony (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333000#comment-17333000
 ] 

Anthony commented on SPARK-18105:
-

We are seeing similar issues with Spark 3.0.1 as well. Not exactly reproducible 
but often seen

org.apache.spark.shuffle.FetchFailedException: Stream is corrupted
 at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:823)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
 at java.io.DataInputStream.readInt(DataInputStream.java:387)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.(UnsafeRowSerializer.scala:120)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.asKeyValueIterator(UnsafeRowSerializer.scala:110)
 at 
org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$2(BlockStoreShuffleReader.scala:92)
 at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
 at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.sort_addToSorter_0$(Unknown
 Source)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage72.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
 at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:860)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.(SortMergeJoinExec.scala:730)
 at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:254)
 at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Stream is corrupted
 at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:250)
 at net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:819)
 ... 35 more
Caused by: net.jpountz.lz4.LZ4Exception: Error decoding offset 10122 of input 
buffer
 at 
net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:39)
 at net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:245)
 ... 37 more

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Priority: Major
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> 

[jira] [Commented] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332996#comment-17332996
 ] 

Apache Spark commented on SPARK-35240:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32361

> Use CheckpointFileManager for checkpoint manipulation
> -
>
> Key: SPARK-35240
> URL: https://issues.apache.org/jira/browse/SPARK-35240
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `CheckpointFileManager` is designed to handle checkpoint file manipulation. 
> However, there are a few places exposing FileSystem from checkpoint 
> files/paths. We should use `CheckpointFileManager` to manipulate checkpoint 
> files. For example, we may want to have one storage system for checkpoint 
> file. If all checkpoint file manipulation is performed through 
> `CheckpointFileManager`, we can only implement `CheckpointFileManager` for 
> the storage system, and don't need to implement FileSystem API for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35240:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Use CheckpointFileManager for checkpoint manipulation
> -
>
> Key: SPARK-35240
> URL: https://issues.apache.org/jira/browse/SPARK-35240
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> `CheckpointFileManager` is designed to handle checkpoint file manipulation. 
> However, there are a few places exposing FileSystem from checkpoint 
> files/paths. We should use `CheckpointFileManager` to manipulate checkpoint 
> files. For example, we may want to have one storage system for checkpoint 
> file. If all checkpoint file manipulation is performed through 
> `CheckpointFileManager`, we can only implement `CheckpointFileManager` for 
> the storage system, and don't need to implement FileSystem API for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation

2021-04-27 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35240:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Use CheckpointFileManager for checkpoint manipulation
> -
>
> Key: SPARK-35240
> URL: https://issues.apache.org/jira/browse/SPARK-35240
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `CheckpointFileManager` is designed to handle checkpoint file manipulation. 
> However, there are a few places exposing FileSystem from checkpoint 
> files/paths. We should use `CheckpointFileManager` to manipulate checkpoint 
> files. For example, we may want to have one storage system for checkpoint 
> file. If all checkpoint file manipulation is performed through 
> `CheckpointFileManager`, we can only implement `CheckpointFileManager` for 
> the storage system, and don't need to implement FileSystem API for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation

2021-04-27 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332995#comment-17332995
 ] 

Apache Spark commented on SPARK-35240:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32361

> Use CheckpointFileManager for checkpoint manipulation
> -
>
> Key: SPARK-35240
> URL: https://issues.apache.org/jira/browse/SPARK-35240
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> `CheckpointFileManager` is designed to handle checkpoint file manipulation. 
> However, there are a few places exposing FileSystem from checkpoint 
> files/paths. We should use `CheckpointFileManager` to manipulate checkpoint 
> files. For example, we may want to have one storage system for checkpoint 
> file. If all checkpoint file manipulation is performed through 
> `CheckpointFileManager`, we can only implement `CheckpointFileManager` for 
> the storage system, and don't need to implement FileSystem API for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >