[jira] [Comment Edited] (SPARK-38388) Repartition + Stage retries could lead to incorrect data

2022-03-25 Thread Jason Xu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510162#comment-17510162
 ] 

Jason Xu edited comment on SPARK-38388 at 3/26/22, 5:32 AM:


Thank you [~mridulm80] ! Wenchen also suggested to propagate the deterministic 
level in dev email thread: 
[https://lists.apache.org/thread/z5b8qssg51024nmtvk6gr2skxctl6xcm. 
|https://lists.apache.org/thread/z5b8qssg51024nmtvk6gr2skxctl6xcm]. I'm looking 
into it.


was (Author: kings129):
Thank you [~mridulm80] ! Wenchen also suggested to propagate the deterministic 
level in dev email thread: 
[https://lists.apache.org/thread/z5b8qssg51024nmtvk6gr2skxctl6xcm. 
|https://lists.apache.org/thread/z5b8qssg51024nmtvk6gr2skxctl6xcm.]I'm looking 
into it.

> Repartition + Stage retries could lead to incorrect data 
> -
>
> Key: SPARK-38388
> URL: https://issues.apache.org/jira/browse/SPARK-38388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.1.1
> Environment: Spark 2.4 and 3.1
>Reporter: Jason Xu
>Priority: Major
>  Labels: correctness, data-loss
>
> Spark repartition uses RoundRobinPartitioning, the generated results is 
> non-deterministic when data has some randomness and stage/task retries happen.
> The bug can be triggered when upstream data has some randomness, a 
> repartition is called on them, then followed by result stage (could be more 
> stages).
> As the pattern shows below:
> upstream stage (data with randomness) -> (repartition shuffle) -> result stage
> When one executor goes down at result stage, some tasks of that stage might 
> have finished, others would fail, shuffle files on that executor also get 
> lost, some tasks from previous stage (upstream data generation, repartition) 
> will need to rerun to generate dependent shuffle data files.
> Because data has some randomness, regenerated data in upstream retried tasks 
> is slightly different, repartition then generates inconsistent ordering, then 
> tasks at result stage will be retried generating different data.
> This is similar but different to 
> https://issues.apache.org/jira/browse/SPARK-23207, fix for it uses extra 
> local sort to make the row ordering deterministic, the sorting algorithm it 
> uses simply compares row/record hash. But in this case, upstream data has 
> some randomness, the sorting algorithm doesn't help keep the order, thus 
> RoundRobinPartitioning introduced non-deterministic result.
> The following code returns 986415, instead of 100:
> {code:java}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> case class TestObject(id: Long, value: Double)
> val ds = spark.range(0, 1000 * 1000, 1).repartition(100, 
> $"id").withColumn("val", rand()).repartition(100).map { 
>   row => if (TaskContext.get.stageAttemptNumber == 0 && 
> TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId > 97) {
> throw new Exception("pkill -f java".!!)
>   }
>   TestObject(row.getLong(0), row.getDouble(1))
> }
> ds.toDF("id", "value").write.mode("overwrite").saveAsTable("tmp.test_table")
> spark.sql("select count(distinct id) from tmp.test_table").show{code}
> Command: 
> {code:java}
> spark-shell --num-executors 10 (--conf spark.dynamicAllocation.enabled=false 
> --conf spark.shuffle.service.enabled=false){code}
> To simulate the issue, disable external shuffle service is needed (if it's 
> also enabled by default in your environment),  this is to trigger shuffle 
> file loss and previous stage retries.
> In our production, we have external shuffle service enabled, this data 
> correctness issue happened when there were node losses.
> Although there's some non-deterministic factor in upstream data, user 
> wouldn't expect  to see incorrect result.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38336) Catalyst changes for DEFAULT column support

2022-03-25 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-38336:
---
Affects Version/s: 3.4.0
   (was: 3.2.1)

> Catalyst changes for DEFAULT column support
> ---
>
> Key: SPARK-38336
> URL: https://issues.apache.org/jira/browse/SPARK-38336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38336) Catalyst changes for DEFAULT column support

2022-03-25 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512670#comment-17512670
 ] 

Gengliang Wang commented on SPARK-38336:


[~dtenedor] the affected version should be 3.4.0 

> Catalyst changes for DEFAULT column support
> ---
>
> Key: SPARK-38336
> URL: https://issues.apache.org/jira/browse/SPARK-38336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38336) Catalyst changes for DEFAULT column support

2022-03-25 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-38336:
--

Assignee: Daniel

> Catalyst changes for DEFAULT column support
> ---
>
> Key: SPARK-38336
> URL: https://issues.apache.org/jira/browse/SPARK-38336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38336) Catalyst changes for DEFAULT column support

2022-03-25 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-38336.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35855
[https://github.com/apache/spark/pull/35855]

> Catalyst changes for DEFAULT column support
> ---
>
> Key: SPARK-38336
> URL: https://issues.apache.org/jira/browse/SPARK-38336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.2.1
>Reporter: Daniel
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-25 Thread qian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512657#comment-17512657
 ] 

qian commented on SPARK-38652:
--

[~ste...@apache.org] No. I can do it, which help us to confirm whether the 
cause of the problem is minio or hadoop-aws-3.3.2. And, I share result about 
test here.

> K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
> --
>
> Key: SPARK-38652
> URL: https://issues.apache.org/jira/browse/SPARK-38652
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Major
>
> DepsTestsSuite in k8s IT test is blocked with PathIOException in 
> hadoop-aws-3.3.2. Exception Message is as follow
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Uploading file 
> /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
>  failed...
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)
> 
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
>
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)   
>  
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)  
>   
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:286)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)
> 
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
>
> at scala.collection.immutable.List.foreach(List.scala:431)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> 
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)   
>  
> at scala.collection.immutable.List.foldLeft(List.scala:91)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)
> 
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)
> 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
> 
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)  
>   
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> org.apache.spark.SparkException: Error uploading file 
> spark-examples_2.12-3.4.0-SNAPSHOT.jar
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)
> 
> ... 30 more
> Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path 
> for 
> URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
>  Input/output error
> at 
> org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)
>  

[jira] [Updated] (SPARK-38655) OffsetWindowFunctionFrameBase cannot find the offset row whose input is not null

2022-03-25 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-38655:
---
Summary: OffsetWindowFunctionFrameBase cannot find the offset row whose 
input is not null  (was: UnboundedPrecedingOffsetWindowFunctionFrame cannot 
find the offset row whose input is not null)

> OffsetWindowFunctionFrameBase cannot find the offset row whose input is not 
> null
> 
>
> Key: SPARK-38655
> URL: https://issues.apache.org/jira/browse/SPARK-38655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between 
> unbounded preceding and current row)
> from (select explode(sequence(1, 3)) x)
> returns
> null
> null
> 3
> But it should returns
> null
> null
> null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38662) Spark looses k8s auth after some time

2022-03-25 Thread Alex (Jira)
Alex created SPARK-38662:


 Summary: Spark looses k8s auth after some time
 Key: SPARK-38662
 URL: https://issues.apache.org/jira/browse/SPARK-38662
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.2.1
Reporter: Alex


Spark starts to fail with error listed below after some time of working:
{noformat}
[2022-03-25 17:11:12,706] INFO  (Logging.scala:57) - Adding decommission script 
to lifecycle                                                                    
                                                   
[2022-03-25 17:11:12,712] WARN  (Logging.scala:90) - Exception when notifying 
snapshot subscriber.                                                            
                                                     
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: https://cluster_endpoint/api/v1/namespaces/spark/pods. Message: 
Unauthorized! Token may have expired! Please log-in again. Unauth
orized.                                                                         
                                                                                
                                                   
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:639)
                                                                                
                        
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:576)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:543)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:504)
        at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:292)
 
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:893)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:372)
        at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$requestNewExecutors$1(ExecutorPodsAllocator.scala:400)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.requestNewExecutors(ExecutorPodsAllocator.scala:382)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36(ExecutorPodsAllocator.scala:346)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$onNewSnapshots$36$adapted(ExecutorPodsAllocator.scala:339)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.onNewSnapshots(ExecutorPodsAllocator.scala:339)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3(ExecutorPodsAllocator.scala:117)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$3$adapted(ExecutorPodsAllocator.scala:117)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.org$apache$spark$scheduler$cluster$k8s$ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber$$processSnapshotsInt
ernal(ExecutorPodsSnapshotsStoreImpl.scala:138)     
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl$SnapshotsSubscriber.processSnapshots(ExecutorPodsSnapshotsStoreImpl.scala:126)
        at 
org.apache.spark.scheduler.cluster.k8s.ExecutorPodsSnapshotsStoreImpl.$anonfun$addSubscriber$1(ExecutorPodsSnapshotsStoreImpl.scala:81)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834){noformat}

This doesn't reproduce on 3.1.1 with the same configs, environment and workload.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38660) PySpark DeprecationWarning: distutils Version classes are deprecated

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512595#comment-17512595
 ] 

Apache Spark commented on SPARK-38660:
--

User 'kianelbo' has created a pull request for this issue:
https://github.com/apache/spark/pull/35977

> PySpark DeprecationWarning: distutils Version classes are deprecated
> 
>
> Key: SPARK-38660
> URL: https://issues.apache.org/jira/browse/SPARK-38660
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Gergely Kalmar
>Priority: Major
>
> When executing spark.read.csv(f'\{gcs_bucket}/\{data_file}', 
> inferSchema=True, header=True) I'm getting the following warning:
> {noformat}
> .../lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:62: in 
> toPandas
> require_minimum_pandas_version()
> .../lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: in 
> require_minimum_pandas_version
> if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> self = <[AttributeError("'LooseVersion' object has no attribute 'vstring'") 
> raised in repr()] LooseVersion object at 0x7f2319fc0f70>, vstring = '1.4.1'
> def __init__ (self, vstring=None):
> >   warnings.warn(
> "distutils Version classes are deprecated. "
> "Use packaging.version instead.",
> DeprecationWarning,
> stacklevel=2,
> )
> E   DeprecationWarning: distutils Version classes are deprecated. Use 
> packaging.version instead.
> .../lib/python3.8/site-packages/setuptools/_distutils/version.py:53: 
> DeprecationWarning
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38660) PySpark DeprecationWarning: distutils Version classes are deprecated

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512594#comment-17512594
 ] 

Apache Spark commented on SPARK-38660:
--

User 'kianelbo' has created a pull request for this issue:
https://github.com/apache/spark/pull/35977

> PySpark DeprecationWarning: distutils Version classes are deprecated
> 
>
> Key: SPARK-38660
> URL: https://issues.apache.org/jira/browse/SPARK-38660
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Gergely Kalmar
>Priority: Major
>
> When executing spark.read.csv(f'\{gcs_bucket}/\{data_file}', 
> inferSchema=True, header=True) I'm getting the following warning:
> {noformat}
> .../lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:62: in 
> toPandas
> require_minimum_pandas_version()
> .../lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: in 
> require_minimum_pandas_version
> if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> self = <[AttributeError("'LooseVersion' object has no attribute 'vstring'") 
> raised in repr()] LooseVersion object at 0x7f2319fc0f70>, vstring = '1.4.1'
> def __init__ (self, vstring=None):
> >   warnings.warn(
> "distutils Version classes are deprecated. "
> "Use packaging.version instead.",
> DeprecationWarning,
> stacklevel=2,
> )
> E   DeprecationWarning: distutils Version classes are deprecated. Use 
> packaging.version instead.
> .../lib/python3.8/site-packages/setuptools/_distutils/version.py:53: 
> DeprecationWarning
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38660) PySpark DeprecationWarning: distutils Version classes are deprecated

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38660:


Assignee: Apache Spark

> PySpark DeprecationWarning: distutils Version classes are deprecated
> 
>
> Key: SPARK-38660
> URL: https://issues.apache.org/jira/browse/SPARK-38660
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Gergely Kalmar
>Assignee: Apache Spark
>Priority: Major
>
> When executing spark.read.csv(f'\{gcs_bucket}/\{data_file}', 
> inferSchema=True, header=True) I'm getting the following warning:
> {noformat}
> .../lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:62: in 
> toPandas
> require_minimum_pandas_version()
> .../lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: in 
> require_minimum_pandas_version
> if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> self = <[AttributeError("'LooseVersion' object has no attribute 'vstring'") 
> raised in repr()] LooseVersion object at 0x7f2319fc0f70>, vstring = '1.4.1'
> def __init__ (self, vstring=None):
> >   warnings.warn(
> "distutils Version classes are deprecated. "
> "Use packaging.version instead.",
> DeprecationWarning,
> stacklevel=2,
> )
> E   DeprecationWarning: distutils Version classes are deprecated. Use 
> packaging.version instead.
> .../lib/python3.8/site-packages/setuptools/_distutils/version.py:53: 
> DeprecationWarning
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38660) PySpark DeprecationWarning: distutils Version classes are deprecated

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38660:


Assignee: (was: Apache Spark)

> PySpark DeprecationWarning: distutils Version classes are deprecated
> 
>
> Key: SPARK-38660
> URL: https://issues.apache.org/jira/browse/SPARK-38660
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
>Reporter: Gergely Kalmar
>Priority: Major
>
> When executing spark.read.csv(f'\{gcs_bucket}/\{data_file}', 
> inferSchema=True, header=True) I'm getting the following warning:
> {noformat}
> .../lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:62: in 
> toPandas
> require_minimum_pandas_version()
> .../lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: in 
> require_minimum_pandas_version
> if LooseVersion(pandas.__version__) < 
> LooseVersion(minimum_pandas_version):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> self = <[AttributeError("'LooseVersion' object has no attribute 'vstring'") 
> raised in repr()] LooseVersion object at 0x7f2319fc0f70>, vstring = '1.4.1'
> def __init__ (self, vstring=None):
> >   warnings.warn(
> "distutils Version classes are deprecated. "
> "Use packaging.version instead.",
> DeprecationWarning,
> stacklevel=2,
> )
> E   DeprecationWarning: distutils Version classes are deprecated. Use 
> packaging.version instead.
> .../lib/python3.8/site-packages/setuptools/_distutils/version.py:53: 
> DeprecationWarning
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38661) [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512577#comment-17512577
 ] 

Apache Spark commented on SPARK-38661:
--

User 'martin-g' has created a pull request for this issue:
https://github.com/apache/spark/pull/35976

> [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests
> -
>
> Key: SPARK-38661
> URL: https://issues.apache.org/jira/browse/SPARK-38661
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0
>
>
> This ticket is a follow up of SPARK-38351.
>  
> When building with Scala 2.13 many test classes produce warnings like:
> {code:java}
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("d") instead
> [warn]   'd.cast("string"),
> [warn]   ^
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("e") instead
> [warn]   'e.cast("string")).collect())
>  {code}
> For easier migration to Scala 3.x later it would be good to fix this warnings!
>  
> Also as suggested by [https://github.com/HeartSaVioR] it would be good to use 
> Spark's $"abc" syntax for columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38661) [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38661:


Assignee: Apache Spark  (was: Martin Tzvetanov Grigorov)

> [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests
> -
>
> Key: SPARK-38661
> URL: https://issues.apache.org/jira/browse/SPARK-38661
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> This ticket is a follow up of SPARK-38351.
>  
> When building with Scala 2.13 many test classes produce warnings like:
> {code:java}
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("d") instead
> [warn]   'd.cast("string"),
> [warn]   ^
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("e") instead
> [warn]   'e.cast("string")).collect())
>  {code}
> For easier migration to Scala 3.x later it would be good to fix this warnings!
>  
> Also as suggested by [https://github.com/HeartSaVioR] it would be good to use 
> Spark's $"abc" syntax for columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38661) [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38661:


Assignee: Martin Tzvetanov Grigorov  (was: Apache Spark)

> [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests
> -
>
> Key: SPARK-38661
> URL: https://issues.apache.org/jira/browse/SPARK-38661
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0
>
>
> This ticket is a follow up of SPARK-38351.
>  
> When building with Scala 2.13 many test classes produce warnings like:
> {code:java}
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("d") instead
> [warn]   'd.cast("string"),
> [warn]   ^
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("e") instead
> [warn]   'e.cast("string")).collect())
>  {code}
> For easier migration to Scala 3.x later it would be good to fix this warnings!
>  
> Also as suggested by [https://github.com/HeartSaVioR] it would be good to use 
> Spark's $"abc" syntax for columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38661) [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests

2022-03-25 Thread Martin Tzvetanov Grigorov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Tzvetanov Grigorov updated SPARK-38661:
--
Description: 
This ticket is a follow up of SPARK-38351.

 

When building with Scala 2.13 many test classes produce warnings like:
{code:java}
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("d") instead
[warn]   'd.cast("string"),
[warn]   ^
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("e") instead
[warn]   'e.cast("string")).collect())
 {code}
For easier migration to Scala 3.x later it would be good to fix this warnings!

 

Also as suggested by [https://github.com/HeartSaVioR] it would be good to use 
Spark's $"abc" syntax for columns.

  was:
When building with Scala 2.13 many test classes produce warnings like:
{code:java}
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("d") instead
[warn]   'd.cast("string"),
[warn]   ^
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("e") instead
[warn]   'e.cast("string")).collect())
 {code}
For easier migration to Scala 3.x later it would be good to fix this warnings!


> [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests
> -
>
> Key: SPARK-38661
> URL: https://issues.apache.org/jira/browse/SPARK-38661
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Martin Tzvetanov Grigorov
>Priority: Minor
> Fix For: 3.3.0
>
>
> This ticket is a follow up of SPARK-38351.
>  
> When building with Scala 2.13 many test classes produce warnings like:
> {code:java}
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("d") instead
> [warn]   'd.cast("string"),
> [warn]   ^
> [warn] 
> /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
>  [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; 
> use Symbol("e") instead
> [warn]   'e.cast("string")).collect())
>  {code}
> For easier migration to Scala 3.x later it would be good to fix this warnings!
>  
> Also as suggested by [https://github.com/HeartSaVioR] it would be good to use 
> Spark's $"abc" syntax for columns.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38661) [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" in tests

2022-03-25 Thread Martin Tzvetanov Grigorov (Jira)
Martin Tzvetanov Grigorov created SPARK-38661:
-

 Summary: [TESTS] Replace 'abc & Symbol("abc") symbols with $"abc" 
in tests
 Key: SPARK-38661
 URL: https://issues.apache.org/jira/browse/SPARK-38661
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.1
Reporter: Martin Tzvetanov Grigorov
Assignee: Martin Tzvetanov Grigorov
 Fix For: 3.3.0


When building with Scala 2.13 many test classes produce warnings like:
{code:java}
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:562:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("d") instead
[warn]   'd.cast("string"),
[warn]   ^
[warn] 
/home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/BaseScriptTransformationSuite.scala:563:11:
 [deprecation @  | origin= | version=2.13.0] symbol literal is deprecated; use 
Symbol("e") instead
[warn]   'e.cast("string")).collect())
 {code}
For easier migration to Scala 3.x later it would be good to fix this warnings!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-03-25 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38219:


Assignee: jiaan.geng

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38219) Support ANSI aggregation function percentile_cont as window function

2022-03-25 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38219.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35531
[https://github.com/apache/spark/pull/35531]

> Support ANSI aggregation function percentile_cont as window function
> 
>
> Key: SPARK-38219
> URL: https://issues.apache.org/jira/browse/SPARK-38219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> percentile_cont is an aggregate function, some database support it as window 
> function.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2868) Support named accumulators in Python

2022-03-25 Thread Rafal Wojdyla (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512540#comment-17512540
 ] 

Rafal Wojdyla commented on SPARK-2868:
--

Is there a better issue to track the work on named accumulators in pyspark? Is 
it still the case that named accumulators do not work in pyspark and it's not 
possible to see pyspark accumulators in the web UI? Would appreciate you 
feedback [~pwendell] [~holden] [~heathkh] please?

> Support named accumulators in Python
> 
>
> Key: SPARK-2868
> URL: https://issues.apache.org/jira/browse/SPARK-2868
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Patrick Wendell
>Priority: Major
>  Labels: bulk-closed
>
> SPARK-2380 added this for Java/Scala. To allow this in Python we'll need to 
> make some additional changes. One potential path is to have a 1:1 
> correspondence with Scala accumulators (instead of a one-to-many). A 
> challenge is exposing the stringified values of the accumulators to the Scala 
> code.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37618) Support cleaning up shuffle blocks from external shuffle service

2022-03-25 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-37618.
---
Fix Version/s: 3.3.0
 Assignee: Adam Binford
   Resolution: Fixed

> Support cleaning up shuffle blocks from external shuffle service
> 
>
> Key: SPARK-37618
> URL: https://issues.apache.org/jira/browse/SPARK-37618
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently shuffle data is not cleaned up when an external shuffle service is 
> used and the associated executor has been deallocated before the shuffle is 
> cleaned up. Shuffle data is only cleaned up once the application ends.
> There have been various issues filed for this:
> https://issues.apache.org/jira/browse/SPARK-26020
> https://issues.apache.org/jira/browse/SPARK-17233
> https://issues.apache.org/jira/browse/SPARK-4236
> But shuffle files will still stick around until an application completes. 
> Dynamic allocation is commonly used for long running jobs (such as structured 
> streaming), so any long running jobs with a large shuffle involved will 
> eventually fill up local disk space. The shuffle service already supports 
> cleaning up shuffle service persisted RDDs, so it should be able to support 
> cleaning up shuffle blocks as well once the shuffle is removed by the 
> ContextCleaner. 
> The current alternative is to use shuffle tracking instead of an external 
> shuffle service, but this is less optimal from a resource perspective as all 
> executors must be kept alive until the shuffle has been fully consumed and 
> cleaned up (and with the default GC interval being 30 minutes this can waste 
> a lot of time with executors held onto but not doing anything).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-25 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512501#comment-17512501
 ] 

Stu commented on SPARK-26639:
-

Ah, thanks for sharing that [~petertoth] !

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38652) K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2

2022-03-25 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512458#comment-17512458
 ] 

Steve Loughran commented on SPARK-38652:


have you tried running the same suite against an aws s3 endpoint?

> K8S IT Test DepsTestsSuite blocks with PathIOException in hadoop-aws-3.3.2
> --
>
> Key: SPARK-38652
> URL: https://issues.apache.org/jira/browse/SPARK-38652
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.0
>Reporter: qian
>Priority: Major
>
> DepsTestsSuite in k8s IT test is blocked with PathIOException in 
> hadoop-aws-3.3.2. Exception Message is as follow
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Uploading file 
> /Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar
>  failed...
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:332)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.$anonfun$uploadAndTransformFileUris$1(KubernetesUtils.scala:277)
> 
> at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) 
>
> at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)   
>  
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)  
>   
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
> at scala.collection.TraversableLike.map(TraversableLike.scala:286)
> at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
> at scala.collection.AbstractTraversable.map(Traversable.scala:108)
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadAndTransformFileUris(KubernetesUtils.scala:275)
> 
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.$anonfun$getAdditionalPodSystemProperties$1(BasicDriverFeatureStep.scala:187)
>
> at scala.collection.immutable.List.foreach(List.scala:431)
> at 
> org.apache.spark.deploy.k8s.features.BasicDriverFeatureStep.getAdditionalPodSystemProperties(BasicDriverFeatureStep.scala:178)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.$anonfun$buildFromFeatures$5(KubernetesDriverBuilder.scala:86)
> at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> 
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)   
>  
> at scala.collection.immutable.List.foldLeft(List.scala:91)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesDriverBuilder.buildFromFeatures(KubernetesDriverBuilder.scala:84)
> 
> at 
> org.apache.spark.deploy.k8s.submit.Client.run(KubernetesClientApplication.scala:104)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5(KubernetesClientApplication.scala:248)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.$anonfun$run$5$adapted(KubernetesClientApplication.scala:242)
> at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2738)
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.run(KubernetesClientApplication.scala:242)
> 
> at 
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication.start(KubernetesClientApplication.scala:214)
> 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
> 
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) 
>
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)  
>   
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> org.apache.spark.SparkException: Error uploading file 
> spark-examples_2.12-3.4.0-SNAPSHOT.jar
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileToHadoopCompatibleFS(KubernetesUtils.scala:355)
> 
> at 
> org.apache.spark.deploy.k8s.KubernetesUtils$.uploadFileUri(KubernetesUtils.scala:328)
> 
> ... 30 more
> Caused by: org.apache.hadoop.fs.PathIOException: `Cannot get relative path 
> for 
> URI:file:///Users/hengzhen.sq/IdeaProjects/spark/dist/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar':
>  Input/output error
> at 
> org.apache.hadoop.fs.s3a.impl.CopyFromLocalOperation.getFinalPath(CopyFromLocalOperation.java:365)
> 
> at 
> 

[jira] [Created] (SPARK-38659) PySpark ResourceWarning: unclosed socket

2022-03-25 Thread Gergely Kalmar (Jira)
Gergely Kalmar created SPARK-38659:
--

 Summary: PySpark ResourceWarning: unclosed socket
 Key: SPARK-38659
 URL: https://issues.apache.org/jira/browse/SPARK-38659
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Gergely Kalmar


Create a file called `spark.py` with the following contents:

```
from pyspark.sql import SparkSession

with SparkSession.builder.getOrCreate() as spark:
    spark.read.csv('test.csv').collect()
```

You can also create a `test.csv` file with whatever data in it. When executing 
`python -Wall spark.py` I get the following warning:

```
/usr/lib/python3.8/socket.py:740: ResourceWarning: unclosed 
  self._sock = None
ResourceWarning: Enable tracemalloc to get the object allocation traceback
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38660) PySpark DeprecationWarning: distutils Version classes are deprecated

2022-03-25 Thread Gergely Kalmar (Jira)
Gergely Kalmar created SPARK-38660:
--

 Summary: PySpark DeprecationWarning: distutils Version classes are 
deprecated
 Key: SPARK-38660
 URL: https://issues.apache.org/jira/browse/SPARK-38660
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.1
Reporter: Gergely Kalmar


When executing spark.read.csv(f'\{gcs_bucket}/\{data_file}', inferSchema=True, 
header=True) I'm getting the following warning:
{noformat}
.../lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:62: in toPandas
require_minimum_pandas_version()
.../lib/python3.8/site-packages/pyspark/sql/pandas/utils.py:35: in 
require_minimum_pandas_version
if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <[AttributeError("'LooseVersion' object has no attribute 'vstring'") 
raised in repr()] LooseVersion object at 0x7f2319fc0f70>, vstring = '1.4.1'

def __init__ (self, vstring=None):
>   warnings.warn(
"distutils Version classes are deprecated. "
"Use packaging.version instead.",
DeprecationWarning,
stacklevel=2,
)
E   DeprecationWarning: distutils Version classes are deprecated. Use 
packaging.version instead.

.../lib/python3.8/site-packages/setuptools/_distutils/version.py:53: 
DeprecationWarning
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38654) Show default index type in SQL plans for pandas API on Spark

2022-03-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-38654.
--
Fix Version/s: 3.3.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/35968

> Show default index type in SQL plans for pandas API on Spark
> 
>
> Key: SPARK-38654
> URL: https://issues.apache.org/jira/browse/SPARK-38654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.3.0
>
>
> Currently, it's difficult for users to tell which plan and expressions are 
> for default index from explain API.
> We should mark and show which plan/expression is for the default index in 
> pandas API on Spark.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28330) ANSI SQL: Top-level in

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512350#comment-17512350
 ] 

Apache Spark commented on SPARK-28330:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35975

> ANSI SQL: Top-level  in 
> 
>
> Key: SPARK-28330
> URL: https://issues.apache.org/jira/browse/SPARK-28330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. {{LIMIT}} and {{OFFSET}}
> LIMIT and OFFSET allow you to retrieve just a portion of the rows that are 
> generated by the rest of the query:
> {noformat}
> SELECT select_list
> FROM table_expression
> [ ORDER BY ... ]
> [ LIMIT { number | ALL } ] [ OFFSET number ]
> {noformat}
> If a limit count is given, no more than that many rows will be returned (but 
> possibly fewer, if the query itself yields fewer rows). LIMIT ALL is the same 
> as omitting the LIMIT clause, as is LIMIT with a NULL argument.
> OFFSET says to skip that many rows before beginning to return rows. OFFSET 0 
> is the same as omitting the OFFSET clause, as is OFFSET with a NULL argument.
> If both OFFSET and LIMIT appear, then OFFSET rows are skipped before starting 
> to count the LIMIT rows that are returned.
> https://www.postgresql.org/docs/11/queries-limit.html
> *Feature ID*: F861



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38645) Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen cleanedSource

2022-03-25 Thread tonydoen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512357#comment-17512357
 ] 

tonydoen commented on SPARK-38645:
--

[https://github.com/apache/spark/pull/35963]

 

> Support `spark.sql.codegen.cleanedSourcePrint` flag to print Codegen 
> cleanedSource
> --
>
> Key: SPARK-38645
> URL: https://issues.apache.org/jira/browse/SPARK-38645
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: tonydoen
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When we use spark-sql,  encountering problems in codegen source, we often 
> have to change the log level to DEBUG, but there are too many logs in this 
> mode (DEBUG) .
>  
> Then `spark.sql.codegen.cleanedSourcePrint` can ensure that just printing 
> codegen source.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38569:
---

Assignee: Alkis Evlogimenos

> external top-level directory is problematic for bazel
> -
>
> Key: SPARK-38569
> URL: https://issues.apache.org/jira/browse/SPARK-38569
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Alkis Evlogimenos
>Assignee: Alkis Evlogimenos
>Priority: Minor
>  Labels: build
> Fix For: 3.4.0
>
>
> {{external}} is a hardwired special name for top-level directories for 
> [bazel|https://bazel.build/]. This causes all sorts of issues with both 
> native/basic bazel or extensions like 
> [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
>  Spark forks using bazel to build Spark have to go through hoops to make 
> things work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38569) external top-level directory is problematic for bazel

2022-03-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38569.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35874
[https://github.com/apache/spark/pull/35874]

> external top-level directory is problematic for bazel
> -
>
> Key: SPARK-38569
> URL: https://issues.apache.org/jira/browse/SPARK-38569
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.1
>Reporter: Alkis Evlogimenos
>Priority: Minor
>  Labels: build
> Fix For: 3.4.0
>
>
> {{external}} is a hardwired special name for top-level directories for 
> [bazel|https://bazel.build/]. This causes all sorts of issues with both 
> native/basic bazel or extensions like 
> [bazel-compile-commands-extractor|https://github.com/hedronvision/bazel-compile-commands-extractor].
>  Spark forks using bazel to build Spark have to go through hoops to make 
> things work if at all.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-38644:

Fix Version/s: 3.3.0
   (was: 3.4.0)

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38644.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 35961
[https://github.com/apache/spark/pull/35961]

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38644) DS V2 topN push-down supports project with alias

2022-03-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38644:
---

Assignee: jiaan.geng

> DS V2 topN push-down supports project with alias
> 
>
> Key: SPARK-38644
> URL: https://issues.apache.org/jira/browse/SPARK-38644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38656) Show options for Pandas API on Spark in UI

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38656:


Assignee: Apache Spark

> Show options for Pandas API on Spark in UI
> --
>
> Key: SPARK-38656
> URL: https://issues.apache.org/jira/browse/SPARK-38656
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Web UI
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we don't know which options are set for Pandas API on Spark. We 
> should show it in UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38656) Show options for Pandas API on Spark in UI

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38656:


Assignee: (was: Apache Spark)

> Show options for Pandas API on Spark in UI
> --
>
> Key: SPARK-38656
> URL: https://issues.apache.org/jira/browse/SPARK-38656
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Web UI
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, we don't know which options are set for Pandas API on Spark. We 
> should show it in UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38657) Rename "SQL" to "SQL/DataFrame" in Spark UI

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38657:


Assignee: Apache Spark

> Rename "SQL" to "SQL/DataFrame" in Spark UI
> ---
>
> Key: SPARK-38657
> URL: https://issues.apache.org/jira/browse/SPARK-38657
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> "SQL" tab actually includes DataFrame APIs (and also DataFrame-based MLlib 
> API and structured streaming). We should name it something else.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38656) Show options for Pandas API on Spark in UI

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512305#comment-17512305
 ] 

Apache Spark commented on SPARK-38656:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35972

> Show options for Pandas API on Spark in UI
> --
>
> Key: SPARK-38656
> URL: https://issues.apache.org/jira/browse/SPARK-38656
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Web UI
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, we don't know which options are set for Pandas API on Spark. We 
> should show it in UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38657) Rename "SQL" to "SQL/DataFrame" in Spark UI

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512306#comment-17512306
 ] 

Apache Spark commented on SPARK-38657:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/35973

> Rename "SQL" to "SQL/DataFrame" in Spark UI
> ---
>
> Key: SPARK-38657
> URL: https://issues.apache.org/jira/browse/SPARK-38657
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> "SQL" tab actually includes DataFrame APIs (and also DataFrame-based MLlib 
> API and structured streaming). We should name it something else.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38657) Rename "SQL" to "SQL/DataFrame" in Spark UI

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38657:


Assignee: (was: Apache Spark)

> Rename "SQL" to "SQL/DataFrame" in Spark UI
> ---
>
> Key: SPARK-38657
> URL: https://issues.apache.org/jira/browse/SPARK-38657
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> "SQL" tab actually includes DataFrame APIs (and also DataFrame-based MLlib 
> API and structured streaming). We should name it something else.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38658) Support PostgreSQL function generate_series

2022-03-25 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-38658.

Resolution: Duplicate

> Support PostgreSQL function generate_series
> ---
>
> Key: SPARK-38658
> URL: https://issues.apache.org/jira/browse/SPARK-38658
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38658) Support PostgreSQL function generate_series

2022-03-25 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38658:
--

 Summary: Support PostgreSQL function generate_series
 Key: SPARK-38658
 URL: https://issues.apache.org/jira/browse/SPARK-38658
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38631) Arbitrary shell command injection via Utils.unpack()

2022-03-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38631:
-
Fix Version/s: 3.1.3
   3.3.0
   3.2.2

> Arbitrary shell command injection via Utils.unpack()
> 
>
> Key: SPARK-38631
> URL: https://issues.apache.org/jira/browse/SPARK-38631
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> There is a risk for arbitrary shell command injection via {{Utils.unpack}} 
> when the filename is controlled by a malicious user. This is due to an issue 
> in Hadoop's {{unTar}}, that is not properly escaping the filename before 
> passing to a shell 
> command:https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L904



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38631) Arbitrary shell command injection via Utils.unpack()

2022-03-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512297#comment-17512297
 ] 

Hyukjin Kwon commented on SPARK-38631:
--

Thanks for pointing this out, [~Qin Yao]

> Arbitrary shell command injection via Utils.unpack()
> 
>
> Key: SPARK-38631
> URL: https://issues.apache.org/jira/browse/SPARK-38631
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.1.3, 3.3.0, 3.2.2
>
>
> There is a risk for arbitrary shell command injection via {{Utils.unpack}} 
> when the filename is controlled by a malicious user. This is due to an issue 
> in Hadoop's {{unTar}}, that is not properly escaping the filename before 
> passing to a shell 
> command:https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L904



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38631) Arbitrary shell command injection via Utils.unpack()

2022-03-25 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512296#comment-17512296
 ] 

Kent Yao commented on SPARK-38631:
--

[~hyukjin.kwon] Hi, shall we update the fixed versions?

> Arbitrary shell command injection via Utils.unpack()
> 
>
> Key: SPARK-38631
> URL: https://issues.apache.org/jira/browse/SPARK-38631
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2, 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> There is a risk for arbitrary shell command injection via {{Utils.unpack}} 
> when the filename is controlled by a malicious user. This is due to an issue 
> in Hadoop's {{unTar}}, that is not properly escaping the filename before 
> passing to a shell 
> command:https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileUtil.java#L904



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32838) Connot insert overwite different partition with same table

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC edited comment on SPARK-32838 at 3/25/22, 10:02 AM:


After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

when set `spark.sql.hive.convertInsertingPartitionedTable`=false
will met this problem 
[SPARK-33144|https://issues.apache.org/jira/browse/SPARK-33144]


was (Author: chenxchen):
After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

wehn

> Connot insert overwite different partition with same table
> --
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32838) Connot insert overwite different partition with same table

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC edited comment on SPARK-32838 at 3/25/22, 10:02 AM:


After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

also, when we set `spark.sql.hive.convertInsertingPartitionedTable`=false
will met this problem SPARK-33144


was (Author: chenxchen):
After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

when set `spark.sql.hive.convertInsertingPartitionedTable`=false
will met this problem 
[SPARK-33144|https://issues.apache.org/jira/browse/SPARK-33144]

> Connot insert overwite different partition with same table
> --
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32838) Connot insert overwite different partition with same table

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17194644#comment-17194644
 ] 

CHC edited comment on SPARK-32838 at 3/25/22, 10:01 AM:


After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

wehn


was (Author: chenxchen):
After spending a long time exploring,

I found that 
[HiveStrategies.scala|https://github.com/apache/spark/blob/v3.0.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L209-L215]
 will convert HiveTableRelation to LogicalRelation,

and this will match this case 
[DataSourceAnalysis|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L166-L228]
 condition

(at spark 2.4.3, HiveTableRelation will not to convert to LogicalRelation if 
table is partitioned,

so if table is none partitioned, insert overwirte itselft also will be get an 
error)

This is ok when:
{code:java}
set spark.sql.hive.convertInsertingPartitionedTable=false;
insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
select id from tmp.spark3_snap where dt='2020-09-09';
{code}
I think this is bug cause this scene is normal demand. 

> Connot insert overwite different partition with same table
> --
>
> Key: SPARK-32838
> URL: https://issues.apache.org/jira/browse/SPARK-32838
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: hadoop 2.7 + spark 3.0.0
>Reporter: CHC
>Priority: Major
>
> When:
> {code:java}
> CREATE TABLE tmp.spark3_snap (
> id string
> )
> PARTITIONED BY (dt string)
> STORED AS ORC
> ;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-09')
> select 10;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select 1;
> insert overwrite table tmp.spark3_snap partition(dt='2020-09-10')
> select id from tmp.spark3_snap where dt='2020-09-09';
> {code}
> and it will be get a error: "Cannot overwrite a path that is also being read 
> from"
> related: https://issues.apache.org/jira/browse/SPARK-24194
> This work on spark 2.4.3 and do not work on spark 3.0.0
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38657) Rename "SQL" to "SQL/DataFrame" in Spark UI

2022-03-25 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38657:


 Summary: Rename "SQL" to "SQL/DataFrame" in Spark UI
 Key: SPARK-38657
 URL: https://issues.apache.org/jira/browse/SPARK-38657
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


"SQL" tab actually includes DataFrame APIs (and also DataFrame-based MLlib API 
and structured streaming). We should name it something else.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC edited comment on SPARK-33144 at 3/25/22, 9:51 AM:
---

also met this at Spark 3.2.1
when set `spark.sql.hive.convertInsertingPartitionedTable`=false
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}


was (Author: chenxchen):
also met this at Spark 3.2.1
when set spark.sql.hive.convertInsertingPartitionedTable=false
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> 

[jira] [Comment Edited] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC edited comment on SPARK-33144 at 3/25/22, 9:51 AM:
---

also met this at Spark 3.2.1
when set spark.sql.hive.convertInsertingPartitionedTable=false
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}


was (Author: chenxchen):
also met this at Spark 3.2.1
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 

[jira] [Created] (SPARK-38656) Show options for Pandas API on Spark in UI

2022-03-25 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-38656:


 Summary: Show options for Pandas API on Spark in UI
 Key: SPARK-38656
 URL: https://issues.apache.org/jira/browse/SPARK-38656
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Web UI
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


Currently, we don't know which options are set for Pandas API on Spark. We 
should show it in UI



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHC updated SPARK-33144:

Priority: Major  (was: Critical)

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=hive/version=2.3.4/part-2-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> 

[jira] [Comment Edited] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC edited comment on SPARK-33144 at 3/25/22, 9:45 AM:
---

also met this at Spark 3.2.1
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}


was (Author: chenxchen):
also met this at Spark 3.2.1
{code:sql}
set hive.exec.dynamic.partition.mode=nonstrict;
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Critical
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 

[jira] [Commented] (SPARK-38655) UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose input is not null

2022-03-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512276#comment-17512276
 ] 

Apache Spark commented on SPARK-38655:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/35971

> UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose 
> input is not null
> --
>
> Key: SPARK-38655
> URL: https://issues.apache.org/jira/browse/SPARK-38655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between 
> unbounded preceding and current row)
> from (select explode(sequence(1, 3)) x)
> returns
> null
> null
> 3
> But it should returns
> null
> null
> null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38655) UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose input is not null

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38655:


Assignee: (was: Apache Spark)

> UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose 
> input is not null
> --
>
> Key: SPARK-38655
> URL: https://issues.apache.org/jira/browse/SPARK-38655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between 
> unbounded preceding and current row)
> from (select explode(sequence(1, 3)) x)
> returns
> null
> null
> 3
> But it should returns
> null
> null
> null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38655) UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose input is not null

2022-03-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38655:


Assignee: Apache Spark

> UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose 
> input is not null
> --
>
> Key: SPARK-38655
> URL: https://issues.apache.org/jira/browse/SPARK-38655
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between 
> unbounded preceding and current row)
> from (select explode(sequence(1, 3)) x)
> returns
> null
> null
> 3
> But it should returns
> null
> null
> null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2022-03-25 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512270#comment-17512270
 ] 

Peter Toth commented on SPARK-26639:


[~stubartmess], that's a different issue but it is fixed in SPARK-36447.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38655) UnboundedPrecedingOffsetWindowFunctionFrame cannot find the offset row whose input is not null

2022-03-25 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-38655:
--

 Summary: UnboundedPrecedingOffsetWindowFunctionFrame cannot find 
the offset row whose input is not null
 Key: SPARK-38655
 URL: https://issues.apache.org/jira/browse/SPARK-38655
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


select x, nth_value(x, 5) IGNORE NULLS over (order by x rows between unbounded 
preceding and current row)
from (select explode(sequence(1, 3)) x)

returns
null
null
3

But it should returns
null
null
null



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC edited comment on SPARK-33144 at 3/25/22, 8:50 AM:
---

also met this at Spark 3.2.1
{code:sql}
set hive.exec.dynamic.partition.mode=nonstrict;
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}


was (Author: chenxchen):
also met this at Spark 3.2.1
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Critical
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 

[jira] [Updated] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHC updated SPARK-33144:

Environment: 
hadoop 2.7.3 + spark 3.0.1
hadoop 2.7.3 + spark 3.2.1

  was:hadoop 2.7.3 + spark 3.0.1


> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
> hadoop 2.7.3 + spark 3.2.1
>Reporter: CHC
>Priority: Critical
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 

[jira] [Updated] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHC updated SPARK-33144:

Priority: Critical  (was: Major)

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
>Reporter: CHC
>Priority: Critical
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=hive/version=2.3.4/part-2-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> 

[jira] [Updated] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

CHC updated SPARK-33144:

Affects Version/s: 3.2.1

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.2.1
> Environment: hadoop 2.7.3 + spark 3.0.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=hive/version=2.3.4/part-2-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> 

[jira] [Comment Edited] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC edited comment on SPARK-33144 at 3/25/22, 8:47 AM:
---

also met this at Spark 3.2.1
{code:sql}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}


was (Author: chenxchen):
also met this at Spark 3.2.1
{code:java}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: hadoop 2.7.3 + spark 3.0.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> 

[jira] [Commented] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512259#comment-17512259
 ] 

CHC commented on SPARK-33144:
-

also met this at Spark 3.2.1
{code:java}
create table tmp.spark_multi_partition(
id int
)
partitioned by (name string, version string)
stored as orc
;

insert overwrite table tmp.spark_multi_partition partition (name, version)
select
1 as id
, 'hadoop' as name
, '2.7.3' as version
;{code}

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: hadoop 2.7.3 + spark 3.0.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 

[jira] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2022-03-25 Thread CHC (Jira)


[ https://issues.apache.org/jira/browse/SPARK-33144 ]


CHC deleted comment on SPARK-33144:
-

was (Author: chenxchen):
I met this problem SPARK-32838 , and change this configuration:
{code:sql}
set spark.sql.hive.convertInsertingPartitionedTable=false;
{code}
after change this, insert into multiple partition will get exception.
  
 

> Connot insert overwite multiple partition, get exception "get partition: 
> Value for key name is null or empty"
> -
>
> Key: SPARK-33144
> URL: https://issues.apache.org/jira/browse/SPARK-33144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: hadoop 2.7.3 + spark 3.0.1
>Reporter: CHC
>Priority: Major
>
> When: 
> {code:sql}
> create table tmp.spark_multi_partition(
> id int
> )
> partitioned by (name string, version string)
> stored as orc
> ;
> set hive.exec.dynamic.partition=true;
> set spark.hadoop.hive.exec.dynamic.partition=true;
>  
> set hive.exec.dynamic.partition.mode=nonstrict;
> set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict;
> insert overwrite table tmp.spark_multi_partition partition (name, version)
> select
> *
> from (
>   select
>   1 as id
>   , 'hadoop' as name
>   , '2.7.3' as version
>   union
>   select
>   2 as id
>   , 'spark' as name
>   , '3.0.1' as version
>   union
>   select
>   3 as id
>   , 'hive' as name
>   , '2.3.4' as version
> ) as A;
> {code}
> and get exception:
> {code:bash}
> INFO load-dynamic-partitions-0 [hive.ql.metadata.Hive:1919]: New loading path 
> = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=spark/version=3.0.1
>  with partSpec {name=spark, version=3.0.1}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-1 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hadoop/version=2.7.3
>  with partSpec {name=hadoop, version=2.7.3}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-2 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/name=hive/version=2.3.4
>  with partSpec {name=hive, version=2.3.4}
> 20/10/14 09:15:33 INFO load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1919]: New loading path = 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0
>  with partSpec {name=, version=}
> 20/10/14 09:15:33 ERROR load-dynamic-partitions-3 
> [hive.ql.metadata.Hive:1937]: Exception when loading partition with 
> parameters  
> partPath=hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/.hive-staging_hive_2020-10-14_09-15-27_718_4118806337003279343-1/-ext-1/_temporary/0,
>   table=spark_multi_partition,  partSpec={name=, version=},  replace=true,  
> listBucketingEnabled=false,  isAcid=false,  hasFollowingStatsTask=false
> org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for 
> key name is null or empty
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2233)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2181)
>   at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1611)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1922)
>   at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1913)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 'hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000'
>  to trash at: 
> hdfs://namespace/user/hive/.Trash/Current/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1/part-1-b745147b-600f-4c79-8ba2-12a99283b0a9.c000
> 20/10/14 09:15:33 INFO load-dynamic-partitions-0 
> [org.apache.hadoop.hive.common.FileUtils:520]: Creating directory if it 
> doesn't exist: 
> hdfs://namespace/apps/hive/warehouse/tmp.db/spark_multi_partition/name=spark/version=3.0.1
> 20/10/14 09:15:33 INFO Delete-Thread-0 
> [org.apache.hadoop.fs.TrashPolicyDefault:168]: Moved: 
> 

[jira] [Resolved] (SPARK-38643) Validate input dataset of ml.regression

2022-03-25 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao resolved SPARK-38643.

  Assignee: zhengruifeng
Resolution: Fixed

> Validate input dataset of ml.regression
> ---
>
> Key: SPARK-38643
> URL: https://issues.apache.org/jira/browse/SPARK-38643
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org