[jira] [Updated] (SPARK-36832) Add LauncherProtocol to K8s resource manager client
[ https://issues.apache.org/jira/browse/SPARK-36832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-36832: --- Labels: pull-request-available (was: ) > Add LauncherProtocol to K8s resource manager client > --- > > Key: SPARK-36832 > URL: https://issues.apache.org/jira/browse/SPARK-36832 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 3.1.2 >Reporter: timothy65535 >Priority: Major > Labels: pull-request-available > > Refer to: org.apache.spark.deploy.yarn.Client > {code:java} > private val launcherBackend = new LauncherBackend() { > override protected def conf: SparkConf = sparkConf override def > onStopRequest(): Unit = { > if (isClusterMode && appId != null) { > yarnClient.killApplication(appId) > } else { > setState(SparkAppHandle.State.KILLED) > stop() > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46976) Implement `DataFrameGroupBy.corr`
[ https://issues.apache.org/jira/browse/SPARK-46976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46976: --- Labels: pull-request-available (was: ) > Implement `DataFrameGroupBy.corr` > - > > Key: SPARK-46976 > URL: https://issues.apache.org/jira/browse/SPARK-46976 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46976) Implement `DataFrameGroupBy.corr`
Ruifeng Zheng created SPARK-46976: - Summary: Implement `DataFrameGroupBy.corr` Key: SPARK-46976 URL: https://issues.apache.org/jira/browse/SPARK-46976 Project: Spark Issue Type: Sub-task Components: PS Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list
[ https://issues.apache.org/jira/browse/SPARK-46975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46975: --- Labels: pull-request-available (was: ) > Move to_{hdf, feather, stata} to the fallback list > -- > > Key: SPARK-46975 > URL: https://issues.apache.org/jira/browse/SPARK-46975 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46975) Move to_{hdf, feather, stata} to the fallback list
Ruifeng Zheng created SPARK-46975: - Summary: Move to_{hdf, feather, stata} to the fallback list Key: SPARK-46975 URL: https://issues.apache.org/jira/browse/SPARK-46975 Project: Spark Issue Type: Sub-task Components: PS Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814184#comment-17814184 ] Yu-Ting LIN commented on SPARK-46934: - [~yao] Thanks. One more question about this hotfix. Will this issue be fixed in both 3.3.4 and 3.5.0 ? > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Priority: Blocker > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(Hiv
[jira] [Updated] (SPARK-28346) clone the query plan between analyzer, optimizer and planner
[ https://issues.apache.org/jira/browse/SPARK-28346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-28346: --- Labels: pull-request-available (was: ) > clone the query plan between analyzer, optimizer and planner > > > Key: SPARK-28346 > URL: https://issues.apache.org/jira/browse/SPARK-28346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46938) Migrate jetty 10 to jetty 11
[ https://issues.apache.org/jira/browse/SPARK-46938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814173#comment-17814173 ] HiuFung Kwok commented on SPARK-46938: -- [~LuciferYang] Hi, it turns out, lib THRIFT will also need to bump to v0.19 as there are a few Servlet references on their end. Do we need a separate ticket for this? But I don't think separate MR is possible, all Jersey, lib Thrfit and Jetty 11 upgrades need to happen within one MR due to the Jakrata package name change. https://issues.apache.org/jira/browse/THRIFT-5700?jql=project%20%3D%20THRIFT%20AND%20text%20~%20%22Jakarta%22%20ORDER%20BY%20created%20DESC > Migrate jetty 10 to jetty 11 > > > Key: SPARK-46938 > URL: https://issues.apache.org/jira/browse/SPARK-46938 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46934) Unable to create Hive View from certain Spark Dataframe StructType
[ https://issues.apache.org/jira/browse/SPARK-46934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814166#comment-17814166 ] Kent Yao commented on SPARK-46934: -- I will fix this > Unable to create Hive View from certain Spark Dataframe StructType > -- > > Key: SPARK-46934 > URL: https://issues.apache.org/jira/browse/SPARK-46934 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.3.2, 3.3.4 > Environment: Tested in Spark 3.3.0, 3.3.2. >Reporter: Yu-Ting LIN >Priority: Blocker > > We are trying to create a Hive View using following SQL command "CREATE OR > REPLACE VIEW yuting AS SELECT INFO_ANN FROM table_2611810". > Our table_2611810 has certain columns contain special characters such as "/". > Here is the schema of this table. > {code:java} > contigName string > start bigint > end bigint > names array > referenceAllele string > alternateAlleles array > qual double > filters array > splitFromMultiAllelic boolean > INFO_NCAMP int > INFO_ODDRATIO double > INFO_NM double > INFO_DBSNP_CAF array > INFO_SPANPAIR int > INFO_TLAMP int > INFO_PSTD double > INFO_QSTD double > INFO_SBF double > INFO_AF array > INFO_QUAL double > INFO_SHIFT3 int > INFO_VARBIAS string > INFO_HICOV int > INFO_PMEAN double > INFO_MSI double > INFO_VD int > INFO_DP int > INFO_HICNT int > INFO_ADJAF double > INFO_SVLEN int > INFO_RSEQ string > INFO_MSigDb array > INFO_NMD array > INFO_ANN > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>> > INFO_BIAS string > INFO_MQ double > INFO_HIAF double > INFO_END int > INFO_SPLITREAD int > INFO_GDAMP int > INFO_LSEQ string > INFO_LOF array > INFO_SAMPLE string > INFO_AMPFLAG int > INFO_SN double > INFO_SVTYPE string > INFO_TYPE string > INFO_MSILEN double > INFO_DUPRATE double > INFO_DBSNP_COMMON int > INFO_REFBIAS string > genotypes > array,ALD:array,AF:array,phased:boolean,calls:array,VD:int,depth:int,RD:array>> > {code} > You can see that column INFO_ANN is an array of struct and it contains column > which has "/" inside such as "cDNA_pos/cDNA_length", etc. > We believe that it is the root cause that cause the following SparkException: > {code:java} > scala> val schema = spark.sql("CREATE OR REPLACE VIEW yuting AS SELECT > INFO_ANN FROM table_2611810") > 24/01/31 07:50:02.658 [main] WARN o.a.spark.sql.catalyst.util.package - > Truncated the string representation of a plan since it was too large. This > behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. > org.apache.spark.SparkException: Cannot recognize hive type string: > array,Annotation_Impact:string,Gene_Name:string,Gene_ID:string,Feature_Type:string,Feature_ID:string,Transcript_BioType:string,Rank:struct,HGVS_c:string,HGVS_p:string,cDNA_pos/cDNA_length:struct,CDS_pos/CDS_length:struct,AA_pos/AA_length:struct,Distance:int,ERRORS/WARNINGS/INFO:string>>, > column: INFO_ANN > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotRecognizeHiveTypeError(QueryExecutionErrors.scala:1455) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.getSparkSQLDataType(HiveClientImpl.scala:1022) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.$anonfun$verifyColumnDataType$1(HiveClientImpl.scala:1037) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType(HiveClientImpl.scala:1037) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$cr
[jira] [Assigned] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values
[ https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46974: Assignee: Kent Yao > Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit > values > > > Key: SPARK-46974 > URL: https://issues.apache.org/jira/browse/SPARK-46974 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values
[ https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46974. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45024 [https://github.com/apache/spark/pull/45024] > Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit > values > > > Key: SPARK-46974 > URL: https://issues.apache.org/jira/browse/SPARK-46974 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46954) XML: Perf optimizations
[ https://issues.apache.org/jira/browse/SPARK-46954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46954. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44997 [https://github.com/apache/spark/pull/44997] > XML: Perf optimizations > --- > > Key: SPARK-46954 > URL: https://issues.apache.org/jira/browse/SPARK-46954 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46954) XML: Perf optimizations
[ https://issues.apache.org/jira/browse/SPARK-46954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46954: Assignee: Sandip Agarwala > XML: Perf optimizations > --- > > Key: SPARK-46954 > URL: https://issues.apache.org/jira/browse/SPARK-46954 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Sandip Agarwala >Assignee: Sandip Agarwala >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values
[ https://issues.apache.org/jira/browse/SPARK-46974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46974: --- Labels: pull-request-available (was: ) > Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit > values > > > Key: SPARK-46974 > URL: https://issues.apache.org/jira/browse/SPARK-46974 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM: - I can't reproduce it in spark 3.5.0. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} seems that the filter has been pushed down. was (Author: JIRAUSER285788): I can't reproduce it at spark 3.5.0. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55,
[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM: - I can't reproduce it in spark 3.5.0. I try to create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} The filter has been pushed down. was (Author: JIRAUSER285788): I can't reproduce it in spark 3.5.0. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56
[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM: - I can't reproduce it at spark 3.5.0. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} seems that the filter has been pushed down. was (Author: JIRAUSER285788): I can't reproduce it. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, co
[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han edited comment on SPARK-45387 at 2/5/24 3:48 AM: - I can't reproduce it. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019'; alter table noaa add partition (year = '2020') LOCATION '/tmp/noaa/year=2020';{code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} seems that the filter has been pushed down. Can you give me a short reproduction? was (Author: JIRAUSER285788): I can't reproduce it. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; {code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [i
[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han edited comment on SPARK-45387 at 2/5/24 3:46 AM: - I can't reproduce it. I tried create a partitioned csv table on hdfs like follow: {code:java} create external table noaa (column0 string, column1 int, column2 string, column3 int, column4 string, column5 string, column6 string, column7 string) PARTITIONED BY (year string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; {code} and the spark plan is {code:java} scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true) == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Project [*] +- 'Filter ('year = 2019) +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan == column0: string, column1: string, column2: string, column3: string, column4: string, column5: string, column6: string, column7: string, year: string GlobalLimit 10 +- LocalLimit 10 +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63] +- Filter (cast(year#63 as int) = 2019) +- SubqueryAlias spark_catalog.default.noaa +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63]]== Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019)) +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan == CollectLimit 10 +- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62, year#63], HiveTableRelation [`spark_catalog`.`default`.`noaa`, org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], [isnotnull(year#63), (cast(year#63 as int) = 2019)]{code} seems that the filter has been pushed down. Can you give me a short reproduction? was (Author: JIRAUSER285788): I can't reproduce, can you give me a short reproduction? > Partition key filter cannot be pushed down when using cast > -- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46974) Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values
Kent Yao created SPARK-46974: Summary: Recover a test case for day-of-year 2-letter 'DD' pattern parsing 3-digit values Key: SPARK-46974 URL: https://issues.apache.org/jira/browse/SPARK-46974 Project: Spark Issue Type: Test Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45387) Partition key filter cannot be pushed down when using cast
[ https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150 ] Jie Han commented on SPARK-45387: - I can't reproduce, can you give me a short reproduction? > Partition key filter cannot be pushed down when using cast > -- > > Key: SPARK-45387 > URL: https://issues.apache.org/jira/browse/SPARK-45387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0 >Reporter: TianyiMa >Priority: Critical > Attachments: PruneFileSourcePartitions.diff > > > Suppose we have a partitioned table `table_pt` with partition colum `dt` > which is StringType and the table metadata is managed by Hive Metastore, if > we filter partition by dt = '123', this filter can be pushed down to data > source, but if the filter condition is number, e.g. dt = 123, that cannot be > pushed down to data source, causing spark to pull all of that table's > partition meta data to client, which is poor of performance if the table has > thousands of partitions and increasing the risk of hive metastore oom. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46955) Implement `Frame.to_stata`
[ https://issues.apache.org/jira/browse/SPARK-46955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-46955. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44996 [https://github.com/apache/spark/pull/44996] > Implement `Frame.to_stata` > -- > > Key: SPARK-46955 > URL: https://issues.apache.org/jira/browse/SPARK-46955 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46955) Implement `Frame.to_stata`
[ https://issues.apache.org/jira/browse/SPARK-46955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-46955: - Assignee: Ruifeng Zheng > Implement `Frame.to_stata` > -- > > Key: SPARK-46955 > URL: https://issues.apache.org/jira/browse/SPARK-46955 > Project: Spark > Issue Type: Sub-task > Components: PS >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.
[ https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-46512. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44512 [https://github.com/apache/spark/pull/44512] > Optimize shuffle reading when both sort and combine are used. > - > > Key: SPARK-46512 > URL: https://issues.apache.org/jira/browse/SPARK-46512 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 4.0.0 >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > After the shuffle reader obtains the block, it will first perform a combine > operation, and then perform a sort operation. It is known that both combine > and sort may generate temporary files, so the performance may be poor when > both sort and combine are used. In fact, combine operations can be performed > during the sort process, and we can avoid the combine spill file. > > I did not find any direct api to construct the shuffle which both sort and > combine is used. But I can do like following code, here is a wordcount, and > the output words is sorted. > {code:java} > sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). > reduceByKey(_ + _, 1). > asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). > collect().foreach(println) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46512) Optimize shuffle reading when both sort and combine are used.
[ https://issues.apache.org/jira/browse/SPARK-46512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-46512: --- Assignee: Chenyu Zheng > Optimize shuffle reading when both sort and combine are used. > - > > Key: SPARK-46512 > URL: https://issues.apache.org/jira/browse/SPARK-46512 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 4.0.0 >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Minor > Labels: pull-request-available > > After the shuffle reader obtains the block, it will first perform a combine > operation, and then perform a sort operation. It is known that both combine > and sort may generate temporary files, so the performance may be poor when > both sort and combine are used. In fact, combine operations can be performed > during the sort process, and we can avoid the combine spill file. > > I did not find any direct api to construct the shuffle which both sort and > combine is used. But I can do like following code, here is a wordcount, and > the output words is sorted. > {code:java} > sc.textFile(input).flatMap(_.split(" ")).map(w => (w, 1)). > reduceByKey(_ + _, 1). > asInstanceOf[ShuffledRDD[String, Int, Int]].setKeyOrdering(Ordering.String). > collect().foreach(println) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46962) Implement python worker to run python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-46962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46962: --- Labels: pull-request-available (was: ) > Implement python worker to run python streaming data source > --- > > Key: SPARK-46962 > URL: https://issues.apache.org/jira/browse/SPARK-46962 > Project: Spark > Issue Type: Improvement > Components: SS >Affects Versions: 4.0.0 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Implement python worker to run python streaming data source and communicate > with JVM through socket. Create a PythonMicrobatchStream to invoke RPC > function call -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46973) Add table cache for V2 tables
Allison Wang created SPARK-46973: Summary: Add table cache for V2 tables Key: SPARK-46973 URL: https://issues.apache.org/jira/browse/SPARK-46973 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
[ https://issues.apache.org/jira/browse/SPARK-46972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46972: --- Labels: pull-request-available (was: ) > Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable > - > > Key: SPARK-46972 > URL: https://issues.apache.org/jira/browse/SPARK-46972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46972) Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable
Kent Yao created SPARK-46972: Summary: Asymmetrical replacement for char/varchar in V2SessionCatalog.createTable Key: SPARK-46972 URL: https://issues.apache.org/jira/browse/SPARK-46972 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org