[jira] [Resolved] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone
[ https://issues.apache.org/jira/browse/SPARK-31515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31515. -- Fix Version/s: 3.0.0 Assignee: Yuanjian Li Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28288# > Canonicalize Cast should consider the value of needTimeZone > --- > > Key: SPARK-31515 > URL: https://issues.apache.org/jira/browse/SPARK-31515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: If we don't need to use `timeZone` information casting > `from` type to `to` type, then the timeZoneId should not influence the > canonicalize result. >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31465) Document Literal in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31465. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28237 > Document Literal in SQL Reference > - > > Key: SPARK-31465 > URL: https://issues.apache.org/jira/browse/SPARK-31465 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Document Literal in SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29185) Add new SaveMode types for Spark SQL jdbc datasource
[ https://issues.apache.org/jira/browse/SPARK-29185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090252#comment-17090252 ] Shashanka Balakuntala Srinivasa commented on SPARK-29185: - Hi, is there anyone working on this issue? If not i will take this up. > Add new SaveMode types for Spark SQL jdbc datasource > > > Key: SPARK-29185 > URL: https://issues.apache.org/jira/browse/SPARK-29185 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Timothy Zhang >Priority: Major > > It is necessary to add new SaveMode for Delete, Update, and Upsert, such as: > * SaveMode.Delete > * SaveMode.Update > * SaveMode.Upsert > So that Spark SQL could support legacy RDBMS much betters, e.g. Oracle, DB2, > MySQL etc. Actually code implementation of current SaveMode.Append types is > very flexible. All types could share the same savePartition function, add > only add new getStatement functions for Delete, Update, Upsert with SQL > statements DELETE FROM, UPDATE, MERGE INTO respectively. We have an initial > implementations for them: > {code:java} > def getDeleteStatement(table: String, rddSchema: StructType, dialect: > JdbcDialect): String = { > val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name) + > "=?").mkString(" AND ") > s"DELETE FROM ${table.toUpperCase} WHERE $columns" > } > def getUpdateStatement(table: String, rddSchema: StructType, priKeys: > Seq[String], dialect: JdbcDialect): String = { > val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)) > val priCols = priKeys.map(dialect.quoteIdentifier(_)) > val columns = (fullCols diff priCols).map(_ + "=?").mkString(",") > val cnditns = priCols.map(_ + "=?").mkString(" AND ") > s"UPDATE ${table.toUpperCase} SET $columns WHERE $cnditns" > } > def getMergeStatement(table: String, rddSchema: StructType, priKeys: > Seq[String], dialect: JdbcDialect): String = { > val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name)) > val priCols = priKeys.map(dialect.quoteIdentifier(_)) > val nrmCols = fullCols diff priCols > val fullPart = fullCols.map(c => > s"${dialect.quoteIdentifier("SRC")}.$c").mkString(",") > val priPart = priCols.map(c => > s"${dialect.quoteIdentifier("TGT")}.$c=${dialect.quoteIdentifier("SRC")}.$c").mkString(" > AND ") > val nrmPart = nrmCols.map(c => > s"$c=${dialect.quoteIdentifier("SRC")}.$c").mkString(",") > val columns = fullCols.mkString(",") > val placeholders = fullCols.map(_ => "?").mkString(",") > s"MERGE INTO ${table.toUpperCase} AS ${dialect.quoteIdentifier("TGT")} " + > s"USING TABLE(VALUES($placeholders)) " + > s"AS ${dialect.quoteIdentifier("SRC")}($columns) " + > s"ON $priPart " + > s"WHEN NOT MATCHED THEN INSERT ($columns) VALUES ($fullPart) " + > s"WHEN MATCHED THEN UPDATE SET $nrmPart" > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31508. -- Resolution: Duplicate > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > Attachments: image-2020-04-22-20-00-09-821.png > > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31525) Inconsistent result of df.head(1) and df.head()
Joshua Hendinata created SPARK-31525: Summary: Inconsistent result of df.head(1) and df.head() Key: SPARK-31525 URL: https://issues.apache.org/jira/browse/SPARK-31525 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.6, 3.0.0 Reporter: Joshua Hendinata In this line [https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339], if you are calling `df.head()` and dataframe is empty, it will return *None* but if you are calling `df.head(1)` and dataframe is empty, it will return *empty list* instead. This particular behaviour is not consistent and can create confusion. Especially when you are calling `len(df.head())` which will throw an exception for empty dataframe -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31518) Expose filterByRange in JavaPairRDD
[ https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31518: -- Labels: (was: patch) > Expose filterByRange in JavaPairRDD > --- > > Key: SPARK-31518 > URL: https://issues.apache.org/jira/browse/SPARK-31518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Antonin Delpeuch >Assignee: Antonin Delpeuch >Priority: Major > Fix For: 3.1.0 > > > The `filterByRange` method makes it possible to efficiently filter a sorted > RDD using bounds on its keys. It prunes out partitions instead of scanning > them if a RangePartitioner is available in the RDD. > This method is part of the Scala API, defined in OrderedRDDFunctions, but is > not exposed in the Java API as far as I can tell. All other methods defined > in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural > to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31518) Expose filterByRange in JavaPairRDD
[ https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31518: -- Affects Version/s: (was: 2.4.5) 3.1.0 > Expose filterByRange in JavaPairRDD > --- > > Key: SPARK-31518 > URL: https://issues.apache.org/jira/browse/SPARK-31518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Antonin Delpeuch >Assignee: Antonin Delpeuch >Priority: Major > Labels: patch > Fix For: 3.1.0 > > > The `filterByRange` method makes it possible to efficiently filter a sorted > RDD using bounds on its keys. It prunes out partitions instead of scanning > them if a RangePartitioner is available in the RDD. > This method is part of the Scala API, defined in OrderedRDDFunctions, but is > not exposed in the Java API as far as I can tell. All other methods defined > in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural > to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31518) Expose filterByRange in JavaPairRDD
[ https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31518. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28293 [https://github.com/apache/spark/pull/28293] > Expose filterByRange in JavaPairRDD > --- > > Key: SPARK-31518 > URL: https://issues.apache.org/jira/browse/SPARK-31518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Antonin Delpeuch >Assignee: Antonin Delpeuch >Priority: Major > Labels: patch > Fix For: 3.1.0 > > > The `filterByRange` method makes it possible to efficiently filter a sorted > RDD using bounds on its keys. It prunes out partitions instead of scanning > them if a RangePartitioner is available in the RDD. > This method is part of the Scala API, defined in OrderedRDDFunctions, but is > not exposed in the Java API as far as I can tell. All other methods defined > in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural > to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31518) Expose filterByRange in JavaPairRDD
[ https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31518: - Assignee: Antonin Delpeuch > Expose filterByRange in JavaPairRDD > --- > > Key: SPARK-31518 > URL: https://issues.apache.org/jira/browse/SPARK-31518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Antonin Delpeuch >Assignee: Antonin Delpeuch >Priority: Major > Labels: patch > > The `filterByRange` method makes it possible to efficiently filter a sorted > RDD using bounds on its keys. It prunes out partitions instead of scanning > them if a RangePartitioner is available in the RDD. > This method is part of the Scala API, defined in OrderedRDDFunctions, but is > not exposed in the Java API as far as I can tell. All other methods defined > in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural > to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved
[ https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-31523: Summary: LogicalPlan doCanonicalize should throw exception if not resolved (was: LogicalPlan doCanonicalize should throw exception if expression is not resolved) > LogicalPlan doCanonicalize should throw exception if not resolved > - > > Key: SPARK-31523 > URL: https://issues.apache.org/jira/browse/SPARK-31523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if expression is not resolved
[ https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-31523: Summary: LogicalPlan doCanonicalize should throw exception if expression is not resolved (was: QueryPlan doCanonicalize should throw exception if expression is not resolved) > LogicalPlan doCanonicalize should throw exception if expression is not > resolved > --- > > Key: SPARK-31523 > URL: https://issues.apache.org/jira/browse/SPARK-31523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31524) Add metric to the split number for skew partition when enable AQE
Ke Jia created SPARK-31524: -- Summary: Add metric to the split number for skew partition when enable AQE Key: SPARK-31524 URL: https://issues.apache.org/jira/browse/SPARK-31524 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Ke Jia Add the details metrics for the split number in skewed partitions when enable AQE and skew join optimization. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31523) QueryPlan doCanonicalize should throw exception if expression is not resolved
[ https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ulysses you updated SPARK-31523: Summary: QueryPlan doCanonicalize should throw exception if expression is not resolved (was: QueryPlan doCanonicalize should throw exception when expression is not resolved) > QueryPlan doCanonicalize should throw exception if expression is not resolved > - > > Key: SPARK-31523 > URL: https://issues.apache.org/jira/browse/SPARK-31523 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31523) QueryPlan doCanonicalize should throw exception when expression is not resolved
ulysses you created SPARK-31523: --- Summary: QueryPlan doCanonicalize should throw exception when expression is not resolved Key: SPARK-31523 URL: https://issues.apache.org/jira/browse/SPARK-31523 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: ulysses you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29641) Stage Level Sched: Add python api's and test
[ https://issues.apache.org/jira/browse/SPARK-29641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-29641: Assignee: Thomas Graves > Stage Level Sched: Add python api's and test > > > Key: SPARK-29641 > URL: https://issues.apache.org/jira/browse/SPARK-29641 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > > For the Stage Level scheduling feature, add any missing python api and test > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29641) Stage Level Sched: Add python api's and test
[ https://issues.apache.org/jira/browse/SPARK-29641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29641. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28085 [https://github.com/apache/spark/pull/28085] > Stage Level Sched: Add python api's and test > > > Key: SPARK-29641 > URL: https://issues.apache.org/jira/browse/SPARK-29641 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > For the Stage Level scheduling feature, add any missing python api and test > it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31272) Support DB2 Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-31272. Fix Version/s: 3.1.0 Assignee: Gabor Somogyi Resolution: Fixed > Support DB2 Kerberos login in JDBC connector > > > Key: SPARK-31272 > URL: https://issues.apache.org/jira/browse/SPARK-31272 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:21 PM: - Hi Team I'm facing following issue with joining two dataframes. For example, in your test case, you can try this: val df = sqlContext.createDataFrame(rdd) val df1 = df; val df2 = df1; val df3 = df1.join(df2, df1("M_ID") === df2("M_ID")) val df4 = df3.join(df2, df3("M_ID") === df2("M_ID")) df4.show() in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. was (Author: kyellam): Hi Team I'm facing following issue with joining two dataframes. For example, in your test case, you can try this: val df = sqlContext.createDataFrame(rdd) val df1 = df; val df2 = df1; val df3 = df1.join(df2, df1("name") === df2("name")) val df4 = df3.join(df2, df3("name") === df2("name")) df4.show() in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:21 PM: - Hi Team I'm facing following issue with joining two dataframes. For example, in your test case, you can try this: val df = sqlContext.createDataFrame(rdd) val df1 = df; val df2 = df1; val df3 = df1.join(df2, df1("M_ID") === df2("M_ID")) val df4 = df3.join(df2, df3("M_ID") === df2("M_ID")) df4.show() in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) Attribute(s) with the same name appear in the operation: M_ID. Please check if the right attribute(s) are used. was (Author: kyellam): Hi Team I'm facing following issue with joining two dataframes. For example, in your test case, you can try this: val df = sqlContext.createDataFrame(rdd) val df1 = df; val df2 = df1; val df3 = df1.join(df2, df1("M_ID") === df2("M_ID")) val df4 = df3.join(df2, df3("M_ID") === df2("M_ID")) df4.show() in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at >
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:20 PM: - Hi Team I'm facing following issue with joining two dataframes. For example, in your test case, you can try this: val df = sqlContext.createDataFrame(rdd) val df1 = df; val df2 = df1; val df3 = df1.join(df2, df1("name") === df2("name")) val df4 = df3.join(df2, df3("name") === df2("name")) df4.show() in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. was (Author: kyellam): Hi Team I'm facing following issue with joining two dataframes. in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) >
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:17 PM: - Hi Team I'm facing following issue with joining two dataframes. in operator !Join LeftOuter, (((cast(M_ID#4443 as bigint) = M_ID#3056L) && (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)). Attribute(s) with the same name appear in the operation: r_ym. Please check if the right attribute(s) are used. was (Author: kyellam): Hi Team I'm facing following issue with joining two dataframes. in operator !Join LeftOuter, (((cast(Contract_ID#7212 as bigint) = contract_id#5029L) && (record_yearmonth#9867 = record_yearmonth#5103)) && (Admin_System#7659 = admin_system#5098)). Attribute(s) with the same name appear in the operation: record_yearmonth. Please check if the right attribute(s) are used.;; > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at >
[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
[ https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089917#comment-17089917 ] Nicholas Chammas commented on SPARK-27623: -- Is this perhaps just a documentation issue? i.e. The documentation [here|http://spark.apache.org/docs/2.4.5/sql-data-sources-avro.html#deploying] should use 2.11 instead of 2.12, or at least clarify how to identify what to use. The [downloads page|http://spark.apache.org/downloads.html] does say: {quote}Note that, Spark is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. {quote} Which I think explains the behavior people are reporting here. > Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > --- > > Key: SPARK-27623 > URL: https://issues.apache.org/jira/browse/SPARK-27623 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.2 >Reporter: Alexandru Barbulescu >Priority: Major > > After updating to spark 2.4.2 when using the > {code:java} > spark.read.format().options().load() > {code} > > chain of methods, regardless of what parameter is passed to "format" we get > the following error related to avro: > > {code:java} > - .options(**load_options) > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line > 172, in load > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line > 1257, in __call__ > - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in > deco > - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load. > - : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.avro.AvroFileFormat could not be instantiated > - at java.util.ServiceLoader.fail(ServiceLoader.java:232) > - at java.util.ServiceLoader.access$100(ServiceLoader.java:185) > - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) > - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) > - at java.util.ServiceLoader$1.next(ServiceLoader.java:480) > - at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > - at scala.collection.Iterator.foreach(Iterator.scala:941) > - at scala.collection.Iterator.foreach$(Iterator.scala:941) > - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > - at scala.collection.IterableLike.foreach(IterableLike.scala:74) > - at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > - at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250) > - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248) > - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > - at scala.collection.TraversableLike.filter(TraversableLike.scala:262) > - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262) > - at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > - at > org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194) > - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167) > - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > - at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > - at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > - at java.lang.reflect.Method.invoke(Method.java:498) > - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > - at py4j.Gateway.invoke(Gateway.java:282) > - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > - at py4j.commands.CallCommand.execute(CallCommand.java:79) > - at py4j.GatewayConnection.run(GatewayConnection.java:238) > - at java.lang.Thread.run(Thread.java:748) > - Caused by: java.lang.NoClassDefFoundError: > org/apache/spark/sql/execution/datasources/FileFormat$class > - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44) > - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > - at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > - at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > - at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > - at java.lang.Class.newInstance(Class.java:442) > - at
[jira] [Updated] (SPARK-31522) Hive metastore client initialization related configurations should be static
[ https://issues.apache.org/jira/browse/SPARK-31522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-31522: - Summary: Hive metastore client initialization related configurations should be static (was: Make immutable configs in HiveUtils static) > Hive metastore client initialization related configurations should be static > - > > Key: SPARK-31522 > URL: https://issues.apache.org/jira/browse/SPARK-31522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > The following configurations defined in HiveUtils should be considered static: > # spark.sql.hive.metastore.version - used to determine the hive version in > Spark > # spark.sql.hive.version - the fake of the above > # spark.sql.hive.metastore.jars - hive metastore related jars location which > is used by spark to create hive client > # spark.sql.hive.metastore.sharedPrefixes and > spark.sql.hive.metastore.barrierPrefixes - packages of classes that are > shared or separated between SparkContextLoader and hive client class loader > Those are used only once when creating the hive metastore client. They should > be static in SQLConf for retrieving them correctly. We should avoid them > being changed by users with SET/RESET command. They should -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31503) fix the SQL string of the TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31503: -- Affects Version/s: 2.3.4 2.4.5 > fix the SQL string of the TRIM functions > > > Key: SPARK-31503 > URL: https://issues.apache.org/jira/browse/SPARK-31503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31503) fix the SQL string of the TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089850#comment-17089850 ] Dongjoon Hyun commented on SPARK-31503: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/28299 . > fix the SQL string of the TRIM functions > > > Key: SPARK-31503 > URL: https://issues.apache.org/jira/browse/SPARK-31503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31503) fix the SQL string of the TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31503: -- Fix Version/s: 2.4.6 > fix the SQL string of the TRIM functions > > > Key: SPARK-31503 > URL: https://issues.apache.org/jira/browse/SPARK-31503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.6, 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31512) make window function using order by optional
[ https://issues.apache.org/jira/browse/SPARK-31512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31512. -- Resolution: Duplicate > make window function using order by optional > > > Key: SPARK-31512 > URL: https://issues.apache.org/jira/browse/SPARK-31512 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5 > Environment: spark2.4.5 > hadoop2.7.7 >Reporter: philipse >Priority: Minor > > Hi all > In other sql dialect ,order by is not the must when using window function,we > may make it pararmteterized > the below is the case : > *select row_number()over() from test1* > Error: org.apache.spark.sql.AnalysisException: Window function row_number() > requires window to be ordered, please add ORDER BY clause. For example SELECT > row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY > window_ordering) from table; (state=,code=0) > > So. i suggest make it as a choice,or we will meet the error when migrate sql > from other dialect,such as hive -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31522) Make immutable configs in HiveUtils static
[ https://issues.apache.org/jira/browse/SPARK-31522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-31522: - Affects Version/s: 3.0.0 2.0.2 2.1.3 2.2.3 2.3.4 2.4.5 > Make immutable configs in HiveUtils static > -- > > Key: SPARK-31522 > URL: https://issues.apache.org/jira/browse/SPARK-31522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > The following configurations defined in HiveUtils should be considered static: > # spark.sql.hive.metastore.version - used to determine the hive version in > Spark > # spark.sql.hive.version - the fake of the above > # spark.sql.hive.metastore.jars - hive metastore related jars location which > is used by spark to create hive client > # spark.sql.hive.metastore.sharedPrefixes and > spark.sql.hive.metastore.barrierPrefixes - packages of classes that are > shared or separated between SparkContextLoader and hive client class loader > Those are used only once when creating the hive metastore client. They should > be static in SQLConf for retrieving them correctly. We should avoid them > being changed by users with SET/RESET command. They should -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31522) Make immutable configs in HiveUtils static
Kent Yao created SPARK-31522: Summary: Make immutable configs in HiveUtils static Key: SPARK-31522 URL: https://issues.apache.org/jira/browse/SPARK-31522 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Kent Yao The following configurations defined in HiveUtils should be considered static: # spark.sql.hive.metastore.version - used to determine the hive version in Spark # spark.sql.hive.version - the fake of the above # spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client # spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes - packages of classes that are shared or separated between SparkContextLoader and hive client class loader Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command. They should -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block
wuyi created SPARK-31521: Summary: The fetch size is not correct when merging blocks into a merged block Key: SPARK-31521 URL: https://issues.apache.org/jira/browse/SPARK-31521 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: wuyi When merging blocks into a merged block, we should count the size of that merged block as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088514#comment-17088514 ] Gabor Somogyi edited comment on SPARK-28367 at 4/22/20, 1:35 PM: - I've taken a look at the possibilities given by the new API in [KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484]. I've found the following problems: * Consumer properties don't match 100% with AdminClient so Consumer properties can't be used for instantiation (at the first glance I think adding this to the Spark API would be an overkill) * With the new API by using AdminClient Spark looses the possibility to use the assign, subscribe and subscribePattern APIs (implementing this logic would be feasible since Kafka consumer does this on client side as well but would be ugly). My main conclusion is that adding AdminClient and using Consumer in a parallel way would be super hacky. I would use either Consumer (which doesn't provide metadata only at the moment) or AdminClient (where it must be checked whether all existing features can be filled + how to add properties). was (Author: gsomogyi): I've taken a look at the possibilities given by the new API in [KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484]. I've found the following problems: * Consumer properties don't match 100% with AdminClient so Consumer properties can't be used for instantiation (at the first glance I think adding this to the API would be an overkill) * With the new API by using AdminClient Spark looses the possibility to use the assign, subscribe and subscribePattern APIs (implementing this logic would be feasible since Kafka consumer does this on client side as well but would be ugly). My main conclusion is that adding AdminClient and using Consumer in a parallel way would be super hacky. I would use either Consumer (which doesn't provide metadata only at the moment) or AdminClient (where it must be checked whether all existing features can be filled + how to add properties). > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089663#comment-17089663 ] philipse commented on SPARK-31508: -- haha ,but normall it will be a little complex,we will migrate many hqls to sparksql,I suggest it will be better dealed with in the code. Can you help review the PR? > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > Attachments: image-2020-04-22-20-00-09-821.png > > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31495) Support formatted explain for Adaptive Query Execution
[ https://issues.apache.org/jira/browse/SPARK-31495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31495. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28271 [https://github.com/apache/spark/pull/28271] > Support formatted explain for Adaptive Query Execution > -- > > Key: SPARK-31495 > URL: https://issues.apache.org/jira/browse/SPARK-31495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Support formatted explain for AQE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31495) Support formatted explain for Adaptive Query Execution
[ https://issues.apache.org/jira/browse/SPARK-31495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31495: --- Assignee: wuyi > Support formatted explain for Adaptive Query Execution > -- > > Key: SPARK-31495 > URL: https://issues.apache.org/jira/browse/SPARK-31495 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Support formatted explain for AQE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089617#comment-17089617 ] Gabor Somogyi commented on SPARK-12312: --- Soon the MS SQL and Oracle server is coming which is not possible to dockerize (fully featuring active directory setup is needed which I can't dockerize). I'm going to test it manually on cluster but if somebody wants to contribute by double checking just ping me. > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089596#comment-17089596 ] JinxinTang edited comment on SPARK-31508 at 4/22/20, 12:00 PM: --- for example: select * from default.test1 where id > 3D add 'D' after number 3 can get a correct result,this may work, !image-2020-04-22-20-00-09-821.png! have fun~ was (Author: jinxintang): for example: select * from default.test1 where id > 3D add 'D' after number 3,this will not convert to long and int may alse work, have fun~ > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > Attachments: image-2020-04-22-20-00-09-821.png > > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31520) Add a readiness probe to spark pod definitions
Dale Richardson created SPARK-31520: --- Summary: Add a readiness probe to spark pod definitions Key: SPARK-31520 URL: https://issues.apache.org/jira/browse/SPARK-31520 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.5 Reporter: Dale Richardson Add a readiness probe to allow so Kubernetes can communicate basic spark pod state to the end user via get/describe pod commands. A basic TCP/SYN probe of the RPC port should be enough to indicate that a spark process is running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JinxinTang updated SPARK-31508: --- Comment: was deleted (was: such '2...' is far larger than long, I think we should not covert to long.) > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089598#comment-17089598 ] JinxinTang edited comment on SPARK-31508 at 4/22/20, 11:49 AM: --- such '2...' is far larger than long, I think we should not covert to long. was (Author: jinxintang): such '2...' is far larger than big, I think we should not covert to long. > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089598#comment-17089598 ] JinxinTang commented on SPARK-31508: such '2...' is far larger than big, I think we should not covert to long. > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089596#comment-17089596 ] JinxinTang commented on SPARK-31508: for example: select * from default.test1 where id > 3D add 'D' after number 3,this will not convert to long and int may alse work, have fun~ > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089583#comment-17089583 ] JinxinTang commented on SPARK-31508: You are right,We can see from code gen in branch 2.4 which will convet use org.apache.spark.unsafe.types.UTF8String#toInt method > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JinxinTang updated SPARK-31508: --- Comment: was deleted (was: You are right,We can see from code gen in branch 2.4 which will convet use org.apache.spark.unsafe.types.UTF8String#toInt method) > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion
[ https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31507: --- Assignee: Kent Yao > Remove millennium, century, decade, millisecond, microsecond and epoch from > extract fucntion > > > Key: SPARK-31507 > URL: https://issues.apache.org/jira/browse/SPARK-31507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Extracting millennium, century, decade, millisecond, microsecond and epoch > from datetime is neither ANSI standard nor quite common in modern SQL > platforms. None of the systems listing below support these except PostgreSQL. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm > https://prestodb.io/docs/current/functions/datetime.html > https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html > https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts > https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion
[ https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31507. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28284 [https://github.com/apache/spark/pull/28284] > Remove millennium, century, decade, millisecond, microsecond and epoch from > extract fucntion > > > Key: SPARK-31507 > URL: https://issues.apache.org/jira/browse/SPARK-31507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Extracting millennium, century, decade, millisecond, microsecond and epoch > from datetime is neither ANSI standard nor quite common in modern SQL > platforms. None of the systems listing below support these except PostgreSQL. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm > https://prestodb.io/docs/current/functions/datetime.html > https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html > https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts > https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31498) Dump public static sql configurations through doc generation
[ https://issues.apache.org/jira/browse/SPARK-31498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31498. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28274 [https://github.com/apache/spark/pull/28274] > Dump public static sql configurations through doc generation > > > Key: SPARK-31498 > URL: https://issues.apache.org/jira/browse/SPARK-31498 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Currently, only the non-static public SQL configurations are dump to public > doc, we'd better also add those static public ones as the command set -v -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31498) Dump public static sql configurations through doc generation
[ https://issues.apache.org/jira/browse/SPARK-31498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31498: --- Assignee: Kent Yao > Dump public static sql configurations through doc generation > > > Key: SPARK-31498 > URL: https://issues.apache.org/jira/browse/SPARK-31498 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Currently, only the non-static public SQL configurations are dump to public > doc, we'd better also add those static public ones as the command set -v -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Sharma resolved SPARK-30467. - Resolution: Cannot Reproduce > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Major > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, If we > configured *spark.network.crypto.enabled true* , Spark Workers are not able > to create Spark Context because of communication between Spark Worker and > Spark Master is failing. > Default Algorithm ( *_spark.network.crypto.keyFactoryAlgorithm_* ) is set to > *_PBKDF2WithHmacSHA1_* and that is one of the Non Approved Cryptographic > Algorithm. We had tried so many values from FIPS Approved Cryptographic > Algorithm but those values are also not working. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE* > JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - > please wait. > JVMDUMP032I JVM requested System dump using > '/bin/core.20200109.064150.283.0001.dmp' in response to an event > JVMDUMP030W Cannot write dump to > file/bin/core.20200109.064150.283.0001.dmp: Permission denied > JVMDUMP012E Error in System dump: The core file created by child process with > pid = 375 was not found. Expected to find core file with name > "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" > JVMDUMP030W Cannot write dump to file > /bin/javacore.20200109.064150.283.0002.txt: Permission denied > JVMDUMP032I JVM requested Java dump using > '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event > JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt > JVMDUMP032I JVM requested Snap dump using > '/bin/Snap.20200109.064150.283.0003.trc' in response to an event > JVMDUMP030W Cannot write dump to file > /bin/Snap.20200109.064150.283.0003.trc: Permission denied > JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc > JVMDUMP030W Cannot write dump to file > /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied > JVMDUMP007I JVM Requesting JIT dump using > '/tmp/jitdump.20200109.064150.283.0004.dmp' > JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp > JVMDUMP013I Processed dump event "abort", detail "". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.
[ https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089517#comment-17089517 ] Prashant Sharma commented on SPARK-30467: - As it is already mentioned here, this is not a SPARK issue. It is an issue related to the JVM in use. As I was able to run various FIPS compliant configurations [Blog link|https://github.com/ScrapCodes/FIPS-compliance/blob/master/blogs/spark-meets-fips.md]. Based on this, I am closing this issue as cannot reproduce. Feel free to reopen, if you can get us more detail and a way to reproduce. > On Federal Information Processing Standard (FIPS) enabled cluster, Spark > Workers are not able to connect to Remote Master. > -- > > Key: SPARK-30467 > URL: https://issues.apache.org/jira/browse/SPARK-30467 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.3.3, 2.3.4, 2.4.4 >Reporter: SHOBHIT SHUKLA >Priority: Major > Labels: security > > On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, If we > configured *spark.network.crypto.enabled true* , Spark Workers are not able > to create Spark Context because of communication between Spark Worker and > Spark Master is failing. > Default Algorithm ( *_spark.network.crypto.keyFactoryAlgorithm_* ) is set to > *_PBKDF2WithHmacSHA1_* and that is one of the Non Approved Cryptographic > Algorithm. We had tried so many values from FIPS Approved Cryptographic > Algorithm but those values are also not working. > *Error logs :* > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST > FAILURE* > JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - > please wait. > JVMDUMP032I JVM requested System dump using > '/bin/core.20200109.064150.283.0001.dmp' in response to an event > JVMDUMP030W Cannot write dump to > file/bin/core.20200109.064150.283.0001.dmp: Permission denied > JVMDUMP012E Error in System dump: The core file created by child process with > pid = 375 was not found. Expected to find core file with name > "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*" > JVMDUMP030W Cannot write dump to file > /bin/javacore.20200109.064150.283.0002.txt: Permission denied > JVMDUMP032I JVM requested Java dump using > '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event > JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt > JVMDUMP032I JVM requested Snap dump using > '/bin/Snap.20200109.064150.283.0003.trc' in response to an event > JVMDUMP030W Cannot write dump to file > /bin/Snap.20200109.064150.283.0003.trc: Permission denied > JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc > JVMDUMP030W Cannot write dump to file > /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied > JVMDUMP007I JVM Requesting JIT dump using > '/tmp/jitdump.20200109.064150.283.0004.dmp' > JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp > JVMDUMP013I Processed dump event "abort", detail "". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31519) Cast in having aggregate expressions returns the wrong result
Yuanjian Li created SPARK-31519: --- Summary: Cast in having aggregate expressions returns the wrong result Key: SPARK-31519 URL: https://issues.apache.org/jira/browse/SPARK-31519 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuanjian Li Cast in having aggregate expressions returns the wrong result. See the below tests: {code:java} scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)") res0: org.apache.spark.sql.DataFrame = [] scala> val query = """ | select sum(a) as b, '2020-01-01' as fake | from t | group by b | having b > 10;""" scala> spark.sql(query).show() +---+--+ | b| fake| +---+--+ | 2|2020-01-01| +---+--+ scala> val query = """ | select sum(a) as b, cast('2020-01-01' as date) as fake | from t | group by b | having b > 10;""" scala> spark.sql(query).show() +---++ | b|fake| +---++ +---++ {code} The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31518) Expose filterByRange in JavaPairRDD
[ https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089514#comment-17089514 ] Antonin Delpeuch commented on SPARK-31518: -- Corresponding pull request: https://github.com/apache/spark/pull/28293 > Expose filterByRange in JavaPairRDD > --- > > Key: SPARK-31518 > URL: https://issues.apache.org/jira/browse/SPARK-31518 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: Antonin Delpeuch >Priority: Major > Labels: patch > > The `filterByRange` method makes it possible to efficiently filter a sorted > RDD using bounds on its keys. It prunes out partitions instead of scanning > them if a RangePartitioner is available in the RDD. > This method is part of the Scala API, defined in OrderedRDDFunctions, but is > not exposed in the Java API as far as I can tell. All other methods defined > in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural > to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
[ https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZHANG Wei updated SPARK-31516: -- External issue URL: (was: https://github.com/apache/spark/pull/28292) > Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc > --- > > Key: SPARK-31516 > URL: https://issues.apache.org/jira/browse/SPARK-31516 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 3.0.0 >Reporter: ZHANG Wei >Priority: Minor > > There is a duplicated `hiveClientCalls.count` metric in both > `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists > of [Spark Monitoring > doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], > but there is only one inside object HiveCatalogMetrics in [source > code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. > {quote} * namespace=HiveExternalCatalog > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** fileCacheHits.count > ** filesDiscovered.count > ** +{color:#ff}*hiveClientCalls.count*{color}+ > ** parallelListingJobCount.count > ** partitionsFetched.count > * namespace=CodeGenerator > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** compilationTime (histogram) > ** generatedClassSize (histogram) > ** generatedMethodSize (histogram) > ** *{color:#ff}+hiveClientCalls.count+{color}* > ** sourceCodeSize (histogram){quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
[ https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZHANG Wei updated SPARK-31516: -- External issue URL: https://github.com/apache/spark/pull/28292 > Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc > --- > > Key: SPARK-31516 > URL: https://issues.apache.org/jira/browse/SPARK-31516 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 3.0.0 >Reporter: ZHANG Wei >Priority: Minor > > There is a duplicated `hiveClientCalls.count` metric in both > `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists > of [Spark Monitoring > doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], > but there is only one inside object HiveCatalogMetrics in [source > code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. > {quote} * namespace=HiveExternalCatalog > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** fileCacheHits.count > ** filesDiscovered.count > ** +{color:#ff}*hiveClientCalls.count*{color}+ > ** parallelListingJobCount.count > ** partitionsFetched.count > * namespace=CodeGenerator > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** compilationTime (histogram) > ** generatedClassSize (histogram) > ** generatedMethodSize (histogram) > ** *{color:#ff}+hiveClientCalls.count+{color}* > ** sourceCodeSize (histogram){quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
[ https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZHANG Wei updated SPARK-31516: -- Description: There is a duplicated `hiveClientCalls.count` metric in both `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists of [Spark Monitoring doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], but there is only one inside object HiveCatalogMetrics in [source code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. {quote} * namespace=HiveExternalCatalog ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** fileCacheHits.count ** filesDiscovered.count ** +{color:#ff}*hiveClientCalls.count*{color}+ ** parallelListingJobCount.count ** partitionsFetched.count * namespace=CodeGenerator ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** compilationTime (histogram) ** generatedClassSize (histogram) ** generatedMethodSize (histogram) ** *{color:#ff}+hiveClientCalls.count+{color}* ** sourceCodeSize (histogram){quote} was: There is a duplicated `hiveClientCalls.count` metric in both `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists of [Spark Monitoring doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], but there is only one of object HiveCatalogMetrics in [source code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. {quote} * namespace=HiveExternalCatalog ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** fileCacheHits.count ** filesDiscovered.count ** +{color:#ff}*hiveClientCalls.count*{color}+ ** parallelListingJobCount.count ** partitionsFetched.count * namespace=CodeGenerator ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** compilationTime (histogram) ** generatedClassSize (histogram) ** generatedMethodSize (histogram) ** *{color:#ff}+hiveClientCalls.count+{color}* ** sourceCodeSize (histogram){quote} > Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc > --- > > Key: SPARK-31516 > URL: https://issues.apache.org/jira/browse/SPARK-31516 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core >Affects Versions: 3.0.0 >Reporter: ZHANG Wei >Priority: Minor > > There is a duplicated `hiveClientCalls.count` metric in both > `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists > of [Spark Monitoring > doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], > but there is only one inside object HiveCatalogMetrics in [source > code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. > {quote} * namespace=HiveExternalCatalog > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** fileCacheHits.count > ** filesDiscovered.count > ** +{color:#ff}*hiveClientCalls.count*{color}+ > ** parallelListingJobCount.count > ** partitionsFetched.count > * namespace=CodeGenerator > ** *note:*: these metrics are conditional to a configuration parameter: > {{spark.metrics.staticSources.enabled}} (default is true) > ** compilationTime (histogram) > ** generatedClassSize (histogram) > ** generatedMethodSize (histogram) > ** *{color:#ff}+hiveClientCalls.count+{color}* > ** sourceCodeSize (histogram){quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31518) Expose filterByRange in JavaPairRDD
Antonin Delpeuch created SPARK-31518: Summary: Expose filterByRange in JavaPairRDD Key: SPARK-31518 URL: https://issues.apache.org/jira/browse/SPARK-31518 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.5 Reporter: Antonin Delpeuch The `filterByRange` method makes it possible to efficiently filter a sorted RDD using bounds on its keys. It prunes out partitions instead of scanning them if a RangePartitioner is available in the RDD. This method is part of the Scala API, defined in OrderedRDDFunctions, but is not exposed in the Java API as far as I can tell. All other methods defined in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural to expose `filterByRange` there too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Bowen updated SPARK-31517: --- Description: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code, excluding the use of the `desc()` function also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} was: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code, excluding the use of the `desc()` function also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} > SparkR::orderBy with multiple columns descending produces error > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Priority: Major > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) > library(SparkR) > cars <- cbind(model = rownames(mtcars), mtcars) > carsDF <- createDataFrame(cars) > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > desc(column("mpg")), desc(column("disp") %>% > head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code, > excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% > mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), > column("mpg"), column("disp" %>% > head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Bowen updated SPARK-31517: --- Summary: SparkR::orderBy with multiple columns descending produces error (was: orderBy with multiple columns descending does not work properly) > SparkR::orderBy with multiple columns descending produces error > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Priority: Major > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), > mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), > desc(column("disp") %>% head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code, > excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" > %>% head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31517) orderBy with multiple columns descending does not work properly
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Bowen updated SPARK-31517: --- Description: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code (excluding the use of the `desc()` function also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} was: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code (excluding the use of the `desc()` function) also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} > orderBy with multiple columns descending does not work properly > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Priority: Major > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), > mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), > desc(column("disp") %>% head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code > (excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" > %>% head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31517) orderBy with multiple columns descending does not work properly
[ https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ross Bowen updated SPARK-31517: --- Description: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code, excluding the use of the `desc()` function also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} was: When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code (excluding the use of the `desc()` function also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} > orderBy with multiple columns descending does not work properly > --- > > Key: SPARK-31517 > URL: https://issues.apache.org/jira/browse/SPARK-31517 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.5 > Environment: Databricks Runtime 6.5 >Reporter: Ross Bowen >Priority: Major > > When specifying two columns within an `orderBy()` function, to attempt to get > an ordering by two columns in descending order, an error is returned. > {code:java} > library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), > mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), > desc(column("disp") %>% head() {code} > This returns an error: > {code:java} > Error in ns[[i]] : subscript out of bounds{code} > This seems to be related to the more general issue that the following code, > excluding the use of the `desc()` function also fails: > {code:java} > carsDF %>% mutate(rank = over(rank(), > orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" > %>% head(){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089449#comment-17089449 ] JinxinTang edited comment on SPARK-31508 at 4/22/20, 8:33 AM: -- select * from test1 where id>'0'; may be seems more normal. Thanks for your issue,I will inspect the code too. was (Author: jinxintang): select id from test1 where id>'0'; may be seems more normal. Thanks for your issue,I will inspect the code too. > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31517) orderBy with multiple columns descending does not work properly
Ross Bowen created SPARK-31517: -- Summary: orderBy with multiple columns descending does not work properly Key: SPARK-31517 URL: https://issues.apache.org/jira/browse/SPARK-31517 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.5 Environment: Databricks Runtime 6.5 Reporter: Ross Bowen When specifying two columns within an `orderBy()` function, to attempt to get an ordering by two columns in descending order, an error is returned. {code:java} library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), desc(column("disp") %>% head() {code} This returns an error: {code:java} Error in ns[[i]] : subscript out of bounds{code} This seems to be related to the more general issue that the following code (excluding the use of the `desc()` function) also fails: {code:java} carsDF %>% mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% head(){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089449#comment-17089449 ] JinxinTang commented on SPARK-31508: select id from test1 where id>'0'; may be seems more normal. Thanks for your issue,I will inspect the code too. > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
ZHANG Wei created SPARK-31516: - Summary: Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc Key: SPARK-31516 URL: https://issues.apache.org/jira/browse/SPARK-31516 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Affects Versions: 3.0.0 Reporter: ZHANG Wei There is a duplicated `hiveClientCalls.count` metric in both `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists of [Spark Monitoring doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor], but there is only one of object HiveCatalogMetrics in [source code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85]. {quote} * namespace=HiveExternalCatalog ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** fileCacheHits.count ** filesDiscovered.count ** +{color:#ff}*hiveClientCalls.count*{color}+ ** parallelListingJobCount.count ** partitionsFetched.count * namespace=CodeGenerator ** *note:*: these metrics are conditional to a configuration parameter: {{spark.metrics.staticSources.enabled}} (default is true) ** compilationTime (histogram) ** generatedClassSize (histogram) ** generatedMethodSize (histogram) ** *{color:#ff}+hiveClientCalls.count+{color}* ** sourceCodeSize (histogram){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone
[ https://issues.apache.org/jira/browse/SPARK-31515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089372#comment-17089372 ] JinxinTang commented on SPARK-31515: Could you please provide code piece or picture to depict this. > Canonicalize Cast should consider the value of needTimeZone > --- > > Key: SPARK-31515 > URL: https://issues.apache.org/jira/browse/SPARK-31515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: If we don't need to use `timeZone` information casting > `from` type to `to` type, then the timeZoneId should not influence the > canonicalize result. >Reporter: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib
[ https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089317#comment-17089317 ] JinxinTang commented on SPARK-31400: Thanks for report this issue, this is in progress: [PR|https://github.com/apache/spark/pull/28291] > The catalogString doesn't distinguish Vectors in ml and mllib > - > > Key: SPARK-31400 > URL: https://issues.apache.org/jira/browse/SPARK-31400 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5 > Environment: Ubuntu 16.04 >Reporter: Junpei Zhou >Priority: Major > > h2. Bug Description > The `catalogString` is not detailed enough to distinguish the > pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors. > h2. How to reproduce the bug > [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an > example from the official document (Python code). If I keep all other lines > untouched, and only modify the Vectors import line, which means: > {code:java} > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > {code} > Or you can directly execute the following code snippet: > {code:java} > from pyspark.ml.feature import MinMaxScaler > # from pyspark.ml.linalg import Vectors > from pyspark.mllib.linalg import Vectors > dataFrame = spark.createDataFrame([ > (0, Vectors.dense([1.0, 0.1, -1.0]),), > (1, Vectors.dense([2.0, 1.1, 1.0]),), > (2, Vectors.dense([3.0, 10.1, 3.0]),) > ], ["id", "features"]) > scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") > scalerModel = scaler.fit(dataFrame) > {code} > It will raise an error: > {code:java} > IllegalArgumentException: 'requirement failed: Column features must be of > type struct,values:array> > but was actually > struct,values:array>.' > {code} > However, the actually struct and the desired struct are exactly the same > string, which cannot provide useful information to the programmer. I would > suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and > pyspark.mllib.linalg.Vectors. > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org