[jira] [Updated] (SPARK-36832) Add LauncherProtocol to K8s resource manager client

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-36832:
---
Labels: pull-request-available  (was: )

> Add LauncherProtocol to K8s resource manager client
> ---
>
> Key: SPARK-36832
> URL: https://issues.apache.org/jira/browse/SPARK-36832
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 3.1.2
>Reporter: timothy65535
>Priority: Major
>  Labels: pull-request-available
>
> Refer to: org.apache.spark.deploy.yarn.Client
> {code:java}
> private val launcherBackend = new LauncherBackend() {
>   override protected def conf: SparkConf = sparkConf  override def 
> onStopRequest(): Unit = {
> if (isClusterMode && appId != null) {
>   yarnClient.killApplication(appId)
> } else {
>   setState(SparkAppHandle.State.KILLED)
>   stop()
> }
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46976:
---
Labels: pull-request-available  (was: )

> Implement `DataFrameGroupBy.corr`
> -
>
> Key: SPARK-46976
> URL: https://issues.apache.org/jira/browse/SPARK-46976
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46976) Implement `DataFrameGroupBy.corr`

2024-02-04 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-46976:
-

 Summary: Implement `DataFrameGroupBy.corr`
 Key: SPARK-46976
 URL: https://issues.apache.org/jira/browse/SPARK-46976
 Project: Spark
  Issue Type: Sub-task
  Components: PS
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46975:
---
Labels: pull-request-available  (was: )

> Move to_{hdf, feather, stata} to the fallback list
> --
>
> Key: SPARK-46975
> URL: https://issues.apache.org/jira/browse/SPARK-46975
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list

2024-02-04 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-46975:
-

 Summary: Move to_{hdf, feather, stata} to the fallback list
 Key: SPARK-46975
 URL: https://issues.apache.org/jira/browse/SPARK-46975
 Project: Spark
  Issue Type: Sub-task
  Components: PS
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-04 Thread Yu-Ting LIN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814184#comment-17814184
 ] 

Yu-Ting LIN commented on SPARK-46934:
-

[~yao] Thanks. One more question about this hotfix. Will this issue be fixed in 
both 3.3.4 and 3.5.0 ?

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Priority: Blocker
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(Hiv

[jira] [Updated] (SPARK-28346) clone the query plan between analyzer, optimizer and planner

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-28346:
---
Labels: pull-request-available  (was: )

> clone the query plan between analyzer, optimizer and planner
> 
>
> Key: SPARK-28346
> URL: https://issues.apache.org/jira/browse/SPARK-28346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46938) Migrate jetty 10 to jetty 11

2024-02-04 Thread HiuFung Kwok (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814173#comment-17814173
 ] 

HiuFung Kwok commented on SPARK-46938:
--

[~LuciferYang] 

Hi, 

it turns out, lib THRIFT will also need to bump to v0.19 as there are a few 
Servlet references on their end.

Do we need a separate ticket for this? But I don't think separate MR is 
possible, all Jersey, lib Thrfit and Jetty 11 upgrades need to happen within 
one MR due to the Jakrata package name change.

https://issues.apache.org/jira/browse/THRIFT-5700?jql=project%20%3D%20THRIFT%20AND%20text%20~%20%22Jakarta%22%20ORDER%20BY%20created%20DESC

> Migrate jetty 10 to jetty 11
> 
>
> Key: SPARK-46938
> URL: https://issues.apache.org/jira/browse/SPARK-46938
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType

2024-02-04 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814166#comment-17814166
 ] 

Kent Yao commented on SPARK-46934:
--

I will fix this 

> Unable to create Hive View from certain Spark Dataframe StructType
> --
>
> Key: SPARK-46934
> URL: https://issues.apache.org/jira/browse/SPARK-46934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.2, 3.3.4
> Environment: Tested in Spark 3.3.0, 3.3.2.
>Reporter: Yu-Ting LIN
>Priority: Blocker
>
> We are trying to create a Hive View using following SQL command "CREATE OR 
> REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810".
> Our table_2611810 has certain columns contain special characters such as "/". 
> Here is the schema of this table.
> {code:java}
> contigName              string
> start                   bigint
> end                     bigint
> names                   array
> referenceAllele         string
> alternateAlleles        array
> qual                    double
> filters                 array
> splitFromMultiAllelic    boolean
> INFO_NCAMP              int
> INFO_ODDRATIO           double
> INFO_NM                 double
> INFO_DBSNP_CAF          array
> INFO_SPANPAIR           int
> INFO_TLAMP              int
> INFO_PSTD               double
> INFO_QSTD               double
> INFO_SBF                double
> INFO_AF                 array
> INFO_QUAL               double
> INFO_SHIFT3             int
> INFO_VARBIAS            string
> INFO_HICOV              int
> INFO_PMEAN              double
> INFO_MSI                double
> INFO_VD                 int
> INFO_DP                 int
> INFO_HICNT              int
> INFO_ADJAF              double
> INFO_SVLEN              int
> INFO_RSEQ               string
> INFO_MSigDb             array
> INFO_NMD                array
> INFO_ANN                
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>
> INFO_BIAS               string
> INFO_MQ                 double
> INFO_HIAF               double
> INFO_END                int
> INFO_SPLITREAD          int
> INFO_GDAMP              int
> INFO_LSEQ               string
> INFO_LOF                array
> INFO_SAMPLE             string
> INFO_AMPFLAG            int
> INFO_SN                 double
> INFO_SVTYPE             string
> INFO_TYPE               string
> INFO_MSILEN             double
> INFO_DUPRATE            double
> INFO_DBSNP_COMMON       int
> INFO_REFBIAS            string
> genotypes               
> array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>>
>  {code}
> You can see that column INFO_ANN is an array of struct and it contains column 
> which has "/" inside such as "cDNA_pos/cDNA_length", etc. 
> We believe that it is the root cause that cause the following SparkException:
> {code:java}
> scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT 
> INFO_ANN FROM table_2611810")
> 24/01/31 07:50:02.658 [main] WARN  o.a.spark.sql.catalyst.util.package - 
> Truncated the string representation of a plan since it was too large. This 
> behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
> org.apache.spark.SparkException: Cannot recognize hive type string: 
> array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>,
>  column: INFO_ANN
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(HiveClientImpl.scala:1037)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$cr

[jira] [Assigned] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values

2024-02-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46974:


Assignee: Kent Yao

> Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit 
> values
> 
>
> Key: SPARK-46974
> URL: https://issues.apache.org/jira/browse/SPARK-46974
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values

2024-02-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46974.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45024
[https://github.com/apache/spark/pull/45024]

> Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit 
> values
> 
>
> Key: SPARK-46974
> URL: https://issues.apache.org/jira/browse/SPARK-46974
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46954) XML: Perf optimizations

2024-02-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46954.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44997
[https://github.com/apache/spark/pull/44997]

> XML: Perf optimizations
> ---
>
> Key: SPARK-46954
> URL: https://issues.apache.org/jira/browse/SPARK-46954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46954) XML: Perf optimizations

2024-02-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46954:


Assignee: Sandip Agarwala

> XML: Perf optimizations
> ---
>
> Key: SPARK-46954
> URL: https://issues.apache.org/jira/browse/SPARK-46954
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Sandip Agarwala
>Assignee: Sandip Agarwala
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46974:
---
Labels: pull-request-available  (was: )

> Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit 
> values
> 
>
> Key: SPARK-46974
> URL: https://issues.apache.org/jira/browse/SPARK-46974
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM:
-

I can't reproduce it in spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.


was (Author: JIRAUSER285788):
I can't reproduce it at spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, 

[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM:
-

I can't reproduce it in spark 3.5.0.

I try to create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
The filter has been pushed down.


was (Author: JIRAUSER285788):
I can't reproduce it in spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56

[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM:
-

I can't reproduce it at spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.


was (Author: JIRAUSER285788):
I can't reproduce it.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
co

[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han edited comment on SPARK-45387 at 2/5/24 3:48 AM:
-

I can't reproduce it.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.

Can you give me a short reproduction?


was (Author: JIRAUSER285788):
I can't reproduce it.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; {code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[i

[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han edited comment on SPARK-45387 at 2/5/24 3:46 AM:
-

I can't reproduce it.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; {code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.

Can you give me a short reproduction?


was (Author: JIRAUSER285788):
I can't reproduce, can you give me a short reproduction?

> Partition key filter cannot be pushed down when using cast
> --
>
> Key: SPARK-45387
> URL: https://issues.apache.org/jira/browse/SPARK-45387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>Reporter: TianyiMa
>Priority: Critical
> Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source, but if the filter condition is number, e.g. dt = 123, that cannot be 
> pushed down to data source, causing spark to pull all of that table's 
> partition meta data to client, which is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values

2024-02-04 Thread Kent Yao (Jira)
Kent Yao created SPARK-46974:


 Summary: Recover a test case for day-of-year 2-letter 'DD' pattern 
parsing 3-digit values
 Key: SPARK-46974
 URL: https://issues.apache.org/jira/browse/SPARK-46974
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45387) Partition key filter cannot be pushed down when using cast

2024-02-04 Thread Jie Han (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ] 

Jie Han commented on SPARK-45387:
-

I can't reproduce, can you give me a short reproduction?

> Partition key filter cannot be pushed down when using cast
> --
>
> Key: SPARK-45387
> URL: https://issues.apache.org/jira/browse/SPARK-45387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>Reporter: TianyiMa
>Priority: Critical
> Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source, but if the filter condition is number, e.g. dt = 123, that cannot be 
> pushed down to data source, causing spark to pull all of that table's 
> partition meta data to client, which is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46955) Implement `Frame.to_stata`

2024-02-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-46955.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44996
[https://github.com/apache/spark/pull/44996]

> Implement `Frame.to_stata`
> --
>
> Key: SPARK-46955
> URL: https://issues.apache.org/jira/browse/SPARK-46955
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46955) Implement `Frame.to_stata`

2024-02-04 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-46955:
-

Assignee: Ruifeng Zheng

> Implement `Frame.to_stata`
> --
>
> Key: SPARK-46955
> URL: https://issues.apache.org/jira/browse/SPARK-46955
> Project: Spark
>  Issue Type: Sub-task
>  Components: PS
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-46512.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44512
[https://github.com/apache/spark/pull/44512]

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.

2024-02-04 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-46512:
---

Assignee: Chenyu Zheng

> Optimize shuffle reading when both sort and combine are used.
> -
>
> Key: SPARK-46512
> URL: https://issues.apache.org/jira/browse/SPARK-46512
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 4.0.0
>Reporter: Chenyu Zheng
>Assignee: Chenyu Zheng
>Priority: Minor
>  Labels: pull-request-available
>
> After the shuffle reader obtains the block, it will first perform a combine 
> operation, and then perform a sort operation. It is known that both combine 
> and sort may generate temporary files, so the performance may be poor when 
> both sort and combine are used. In fact, combine operations can be performed 
> during the sort process, and we can avoid the combine spill file.
>  
> I did not find any direct api to construct the shuffle which both sort and 
> combine is used. But I can do like following code, here is a wordcount, and 
> the output words is sorted.
> {code:java}
> sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)).
> reduceByKey(_ + _, 1).
> asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String).
> collect().foreach(println) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46962) Implement python worker to run python streaming data source

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46962:
---
Labels: pull-request-available  (was: )

> Implement python worker to run python streaming data source
> ---
>
> Key: SPARK-46962
> URL: https://issues.apache.org/jira/browse/SPARK-46962
> Project: Spark
>  Issue Type: Improvement
>  Components: SS
>Affects Versions: 4.0.0
>Reporter: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
>
> Implement python worker to run python streaming data source and communicate 
> with JVM through socket. Create a PythonMicrobatchStream to invoke RPC 
> function call



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46973) Add table cache for V2 tables

2024-02-04 Thread Allison Wang (Jira)
Allison Wang created SPARK-46973:


 Summary: Add table cache for V2 tables
 Key: SPARK-46973
 URL: https://issues.apache.org/jira/browse/SPARK-46973
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable

2024-02-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46972:
---
Labels: pull-request-available  (was: )

> Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
> -
>
> Key: SPARK-46972
> URL: https://issues.apache.org/jira/browse/SPARK-46972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable

2024-02-04 Thread Kent Yao (Jira)
Kent Yao created SPARK-46972:


 Summary: Asymmetrical replacement for char/varchar in 
V2SessionCatalog.createTable
 Key: SPARK-46972
 URL: https://issues.apache.org/jira/browse/SPARK-46972
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org