date:20200422

[jira] [Resolved] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone

2020-04-22 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31515.
--
Fix Version/s: 3.0.0
 Assignee: Yuanjian Li
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28288#

> Canonicalize Cast should consider the value of needTimeZone
> ---
>
> Key: SPARK-31515
> URL: https://issues.apache.org/jira/browse/SPARK-31515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: If we don't need to use `timeZone` information casting 
> `from` type to `to` type, then the timeZoneId should not influence the 
> canonicalize result. 
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31465) Document Literal in SQL Reference

2020-04-22 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31465.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28237

> Document Literal in SQL Reference
> -
>
> Key: SPARK-31465
> URL: https://issues.apache.org/jira/browse/SPARK-31465
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Literal in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29185) Add new SaveMode types for Spark SQL jdbc datasource

2020-04-22 Thread Shashanka Balakuntala Srinivasa (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090252#comment-17090252
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-29185:
-

Hi, is there anyone working on this issue? If not i will take this up.

> Add new SaveMode types for Spark SQL jdbc datasource
> 
>
> Key: SPARK-29185
> URL: https://issues.apache.org/jira/browse/SPARK-29185
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Timothy Zhang
>Priority: Major
>
>  It is necessary to add new SaveMode for Delete, Update, and Upsert, such as:
>  * SaveMode.Delete
>  * SaveMode.Update
>  * SaveMode.Upsert
> So that Spark SQL could support legacy RDBMS much betters, e.g. Oracle, DB2, 
> MySQL etc. Actually code implementation of current SaveMode.Append types is 
> very flexible. All types could share the same savePartition function, add 
> only add new getStatement functions for Delete, Update, Upsert with SQL 
> statements DELETE FROM, UPDATE, MERGE INTO respectively. We have an initial 
> implementations for them:
> {code:java}
> def getDeleteStatement(table: String, rddSchema: StructType, dialect: 
> JdbcDialect): String = {
> val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name) + 
> "=?").mkString(" AND ")
> s"DELETE FROM ${table.toUpperCase} WHERE $columns"
>   }
>   def getUpdateStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val columns = (fullCols diff priCols).map(_ + "=?").mkString(",")
> val cnditns = priCols.map(_ + "=?").mkString(" AND ")
> s"UPDATE ${table.toUpperCase} SET $columns WHERE $cnditns"
>   }
>   def getMergeStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val nrmCols = fullCols diff priCols
> val fullPart = fullCols.map(c => 
> s"${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val priPart = priCols.map(c => 
> s"${dialect.quoteIdentifier("TGT")}.$c=${dialect.quoteIdentifier("SRC")}.$c").mkString("
>  AND ")
> val nrmPart = nrmCols.map(c => 
> s"$c=${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val columns = fullCols.mkString(",")
> val placeholders = fullCols.map(_ => "?").mkString(",")
> s"MERGE INTO ${table.toUpperCase} AS ${dialect.quoteIdentifier("TGT")} " +
>   s"USING TABLE(VALUES($placeholders)) " +
>   s"AS ${dialect.quoteIdentifier("SRC")}($columns) " +
>   s"ON $priPart " +
>   s"WHEN NOT MATCHED THEN INSERT ($columns) VALUES ($fullPart) " +
>   s"WHEN MATCHED THEN UPDATE SET $nrmPart"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31508.
--
Resolution: Duplicate

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
> Attachments: image-2020-04-22-20-00-09-821.png
>
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31525) Inconsistent result of df.head(1) and df.head()

2020-04-22 Thread Joshua Hendinata (Jira)

Joshua Hendinata created SPARK-31525:


 Summary: Inconsistent result of df.head(1) and df.head()
 Key: SPARK-31525
 URL: https://issues.apache.org/jira/browse/SPARK-31525
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.6, 3.0.0
Reporter: Joshua Hendinata


In this line 
[https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1339],
 if you are calling `df.head()` and dataframe is empty, it will return *None*

but if you are calling `df.head(1)` and dataframe is empty, it will return 
*empty list* instead.

This particular behaviour is not consistent and can create confusion. 
Especially when you are calling `len(df.head())` which will throw an exception 
for empty dataframe



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31518:
--
Labels:   (was: patch)

> Expose filterByRange in JavaPairRDD
> ---
>
> Key: SPARK-31518
> URL: https://issues.apache.org/jira/browse/SPARK-31518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Antonin Delpeuch
>Assignee: Antonin Delpeuch
>Priority: Major
> Fix For: 3.1.0
>
>
> The `filterByRange` method makes it possible to efficiently filter a sorted 
> RDD using bounds on its keys. It prunes out partitions instead of scanning 
> them if a RangePartitioner is available in the RDD.
> This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
> not exposed in the Java API as far as I can tell. All other methods defined 
> in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural 
> to expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31518:
--
Affects Version/s: (was: 2.4.5)
   3.1.0

> Expose filterByRange in JavaPairRDD
> ---
>
> Key: SPARK-31518
> URL: https://issues.apache.org/jira/browse/SPARK-31518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Antonin Delpeuch
>Assignee: Antonin Delpeuch
>Priority: Major
>  Labels: patch
> Fix For: 3.1.0
>
>
> The `filterByRange` method makes it possible to efficiently filter a sorted 
> RDD using bounds on its keys. It prunes out partitions instead of scanning 
> them if a RangePartitioner is available in the RDD.
> This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
> not exposed in the Java API as far as I can tell. All other methods defined 
> in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural 
> to expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31518.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28293
[https://github.com/apache/spark/pull/28293]

> Expose filterByRange in JavaPairRDD
> ---
>
> Key: SPARK-31518
> URL: https://issues.apache.org/jira/browse/SPARK-31518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Antonin Delpeuch
>Assignee: Antonin Delpeuch
>Priority: Major
>  Labels: patch
> Fix For: 3.1.0
>
>
> The `filterByRange` method makes it possible to efficiently filter a sorted 
> RDD using bounds on its keys. It prunes out partitions instead of scanning 
> them if a RangePartitioner is available in the RDD.
> This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
> not exposed in the Java API as far as I can tell. All other methods defined 
> in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural 
> to expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31518:
-

Assignee: Antonin Delpeuch

> Expose filterByRange in JavaPairRDD
> ---
>
> Key: SPARK-31518
> URL: https://issues.apache.org/jira/browse/SPARK-31518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Antonin Delpeuch
>Assignee: Antonin Delpeuch
>Priority: Major
>  Labels: patch
>
> The `filterByRange` method makes it possible to efficiently filter a sorted 
> RDD using bounds on its keys. It prunes out partitions instead of scanning 
> them if a RangePartitioner is available in the RDD.
> This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
> not exposed in the Java API as far as I can tell. All other methods defined 
> in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural 
> to expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if not resolved

2020-04-22 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-31523:

Summary: LogicalPlan doCanonicalize should throw exception if not resolved  
(was: LogicalPlan doCanonicalize should throw exception if expression is not 
resolved)

> LogicalPlan doCanonicalize should throw exception if not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31523) LogicalPlan doCanonicalize should throw exception if expression is not resolved

2020-04-22 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-31523:

Summary: LogicalPlan doCanonicalize should throw exception if expression is 
not resolved  (was: QueryPlan doCanonicalize should throw exception if 
expression is not resolved)

> LogicalPlan doCanonicalize should throw exception if expression is not 
> resolved
> ---
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31524) Add metric to the split number for skew partition when enable AQE

2020-04-22 Thread Ke Jia (Jira)

Ke Jia created SPARK-31524:
--

 Summary: Add metric to the split  number for skew partition when 
enable AQE
 Key: SPARK-31524
 URL: https://issues.apache.org/jira/browse/SPARK-31524
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Add the details metrics for the split number in skewed partitions when enable 
AQE and skew join optimization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31523) QueryPlan doCanonicalize should throw exception if expression is not resolved

2020-04-22 Thread ulysses you (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you updated SPARK-31523:

Summary: QueryPlan doCanonicalize should throw exception if expression is 
not resolved  (was: QueryPlan doCanonicalize should throw exception when 
expression is not resolved)

> QueryPlan doCanonicalize should throw exception if expression is not resolved
> -
>
> Key: SPARK-31523
> URL: https://issues.apache.org/jira/browse/SPARK-31523
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31523) QueryPlan doCanonicalize should throw exception when expression is not resolved

2020-04-22 Thread ulysses you (Jira)

ulysses you created SPARK-31523:
---

 Summary: QueryPlan doCanonicalize should throw exception when 
expression is not resolved
 Key: SPARK-31523
 URL: https://issues.apache.org/jira/browse/SPARK-31523
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29641) Stage Level Sched: Add python api's and test

2020-04-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29641:


Assignee: Thomas Graves

> Stage Level Sched: Add python api's and test
> 
>
> Key: SPARK-29641
> URL: https://issues.apache.org/jira/browse/SPARK-29641
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> For the Stage Level scheduling feature, add any missing python api and test 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29641) Stage Level Sched: Add python api's and test

2020-04-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29641.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28085
[https://github.com/apache/spark/pull/28085]

> Stage Level Sched: Add python api's and test
> 
>
> Key: SPARK-29641
> URL: https://issues.apache.org/jira/browse/SPARK-29641
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> For the Stage Level scheduling feature, add any missing python api and test 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31272) Support DB2 Kerberos login in JDBC connector

2020-04-22 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-31272.

Fix Version/s: 3.1.0
 Assignee: Gabor Somogyi
   Resolution: Fixed

> Support DB2 Kerberos login in JDBC connector
> 
>
> Key: SPARK-31272
> URL: https://issues.apache.org/jira/browse/SPARK-31272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2020-04-22 Thread Kushal Yellam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:21 PM:
-

Hi Team 

I'm facing following issue with joining two dataframes.

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
 val df1 = df;
 val df2 = df1;
 val df3 = df1.join(df2, df1("M_ID") === df2("M_ID"))
 val df4 = df3.join(df2, df3("M_ID") === df2("M_ID"))
 df4.show()

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.


was (Author: kyellam):
Hi Team 

I'm facing following issue with joining two dataframes.

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
val df1 = df;
val df2 = df1;
val df3 = df1.join(df2, df1("name") === df2("name"))
val df4 = df3.join(df2, df3("name") === df2("name"))
df4.show()

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2020-04-22 Thread Kushal Yellam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:21 PM:
-

Hi Team 

I'm facing following issue with joining two dataframes.

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
 val df1 = df;
 val df2 = df1;
 val df3 = df1.join(df2, df1("M_ID") === df2("M_ID"))
 val df4 = df3.join(df2, df3("M_ID") === df2("M_ID"))
 df4.show()

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) 
 Attribute(s) with the same name appear in the operation: M_ID. Please check if 
the right attribute(s) are used.


was (Author: kyellam):
Hi Team 

I'm facing following issue with joining two dataframes.

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
 val df1 = df;
 val df2 = df1;
 val df3 = df1.join(df2, df1("M_ID") === df2("M_ID"))
 val df4 = df3.join(df2, df3("M_ID") === df2("M_ID"))
 df4.show()

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
>

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2020-04-22 Thread Kushal Yellam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:20 PM:
-

Hi Team 

I'm facing following issue with joining two dataframes.

For example, in your test case, you can try this:

val df = sqlContext.createDataFrame(rdd)
val df1 = df;
val df2 = df1;
val df3 = df1.join(df2, df1("name") === df2("name"))
val df4 = df3.join(df2, df3("name") === df2("name"))
df4.show()

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.


was (Author: kyellam):
Hi Team 

I'm facing following issue with joining two dataframes.

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>

[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2020-04-22 Thread Kushal Yellam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam edited comment on SPARK-10925 at 4/22/20, 7:17 PM:
-

Hi Team 

I'm facing following issue with joining two dataframes.

in operator !Join LeftOuter,
 (((cast(M_ID#4443 as bigint) = M_ID#3056L) &&
 (r_ym#1213= r_ym#4352)) && (country_code#6654 = country_code#8884)).
 Attribute(s) with the same name appear in the operation: r_ym. Please check if 
the right attribute(s) are used.


was (Author: kyellam):
Hi Team 

I'm facing following issue with joining two dataframes.

in operator !Join LeftOuter,
(((cast(Contract_ID#7212 as bigint) = contract_id#5029L) &&
(record_yearmonth#9867 = record_yearmonth#5103)) && (Admin_System#7659 = 
admin_system#5098)).
Attribute(s) with the same name appear in the operation: record_yearmonth. 
Please check if the right attribute(s) are used.;;

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
>

[jira] [Commented] (SPARK-27623) Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated

2020-04-22 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089917#comment-17089917
 ] 

Nicholas Chammas commented on SPARK-27623:
--

Is this perhaps just a documentation issue? i.e. The documentation 
[here|http://spark.apache.org/docs/2.4.5/sql-data-sources-avro.html#deploying] 
should use 2.11 instead of 2.12, or at least clarify how to identify what to 
use.

The [downloads page|http://spark.apache.org/downloads.html] does say:
{quote}Note that, Spark is pre-built with Scala 2.11 except version 2.4.2, 
which is pre-built with Scala 2.12.
{quote}
Which I think explains the behavior people are reporting here.

> Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> ---
>
> Key: SPARK-27623
> URL: https://issues.apache.org/jira/browse/SPARK-27623
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.2
>Reporter: Alexandru Barbulescu
>Priority: Major
>
> After updating to spark 2.4.2 when using the 
> {code:java}
> spark.read.format().options().load()
> {code}
>  
> chain of methods, regardless of what parameter is passed to "format" we get 
> the following error related to avro:
>  
> {code:java}
> - .options(**load_options)
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 
> 172, in load
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
> 1257, in __call__
> - File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
> deco
> - File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> - py4j.protocol.Py4JJavaError: An error occurred while calling o69.load.
> - : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
> - at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> - at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
> - at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> - at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> - at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> - at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
> - at scala.collection.Iterator.foreach(Iterator.scala:941)
> - at scala.collection.Iterator.foreach$(Iterator.scala:941)
> - at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> - at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> - at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> - at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> - at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
> - at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
> - at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
> - at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
> - at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
> - at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
> - at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
> - at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
> - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> - at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> - at java.lang.reflect.Method.invoke(Method.java:498)
> - at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> - at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> - at py4j.Gateway.invoke(Gateway.java:282)
> - at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> - at py4j.commands.CallCommand.execute(CallCommand.java:79)
> - at py4j.GatewayConnection.run(GatewayConnection.java:238)
> - at java.lang.Thread.run(Thread.java:748)
> - Caused by: java.lang.NoClassDefFoundError: 
> org/apache/spark/sql/execution/datasources/FileFormat$class
> - at org.apache.spark.sql.avro.AvroFileFormat.(AvroFileFormat.scala:44)
> - at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> - at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> - at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> - at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> - at java.lang.Class.newInstance(Class.java:442)
> - at

[jira] [Updated] (SPARK-31522) Hive metastore client initialization related configurations should be static

2020-04-22 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31522:
-
Summary: Hive metastore client initialization related configurations should 
be static   (was: Make immutable configs in HiveUtils static)

> Hive metastore client initialization related configurations should be static 
> -
>
> Key: SPARK-31522
> URL: https://issues.apache.org/jira/browse/SPARK-31522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> The following configurations defined in HiveUtils should be considered static:
>  # spark.sql.hive.metastore.version - used to determine the hive version in 
> Spark
>  # spark.sql.hive.version - the fake of the above
>  # spark.sql.hive.metastore.jars - hive metastore related jars location which 
> is used by spark to create hive client
>  # spark.sql.hive.metastore.sharedPrefixes and 
> spark.sql.hive.metastore.barrierPrefixes -  packages of classes that are 
> shared or separated between SparkContextLoader and hive client class loader
> Those are used only once when creating the hive metastore client. They should 
> be static in SQLConf for retrieving them correctly. We should avoid them 
> being changed by users with SET/RESET command. They should 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31503) fix the SQL string of the TRIM functions

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31503:
--
Affects Version/s: 2.3.4
   2.4.5

> fix the SQL string of the TRIM functions
> 
>
> Key: SPARK-31503
> URL: https://issues.apache.org/jira/browse/SPARK-31503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31503) fix the SQL string of the TRIM functions

2020-04-22 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089850#comment-17089850
 ] 

Dongjoon Hyun commented on SPARK-31503:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/28299 
.

> fix the SQL string of the TRIM functions
> 
>
> Key: SPARK-31503
> URL: https://issues.apache.org/jira/browse/SPARK-31503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31503) fix the SQL string of the TRIM functions

2020-04-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31503:
--
Fix Version/s: 2.4.6

> fix the SQL string of the TRIM functions
> 
>
> Key: SPARK-31503
> URL: https://issues.apache.org/jira/browse/SPARK-31503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.6, 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31512) make window function using order by optional

2020-04-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31512.
--
Resolution: Duplicate

> make window function using order by optional
> 
>
> Key: SPARK-31512
> URL: https://issues.apache.org/jira/browse/SPARK-31512
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: spark2.4.5
> hadoop2.7.7
>Reporter: philipse
>Priority: Minor
>
> Hi all
> In other sql dialect ,order by is not the must when using window function,we 
> may make it pararmteterized
> the below is the case :
> *select row_number()over() from test1*
> Error: org.apache.spark.sql.AnalysisException: Window function row_number() 
> requires window to be ordered, please add ORDER BY clause. For example SELECT 
> row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY 
> window_ordering) from table; (state=,code=0)
>  
> So. i suggest make it as a choice,or we will meet the error when migrate sql 
> from other dialect，such as hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31522) Make immutable configs in HiveUtils static

2020-04-22 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31522:
-
Affects Version/s: 3.0.0
   2.0.2
   2.1.3
   2.2.3
   2.3.4
   2.4.5

> Make immutable configs in HiveUtils static
> --
>
> Key: SPARK-31522
> URL: https://issues.apache.org/jira/browse/SPARK-31522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> The following configurations defined in HiveUtils should be considered static:
>  # spark.sql.hive.metastore.version - used to determine the hive version in 
> Spark
>  # spark.sql.hive.version - the fake of the above
>  # spark.sql.hive.metastore.jars - hive metastore related jars location which 
> is used by spark to create hive client
>  # spark.sql.hive.metastore.sharedPrefixes and 
> spark.sql.hive.metastore.barrierPrefixes -  packages of classes that are 
> shared or separated between SparkContextLoader and hive client class loader
> Those are used only once when creating the hive metastore client. They should 
> be static in SQLConf for retrieving them correctly. We should avoid them 
> being changed by users with SET/RESET command. They should 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31522) Make immutable configs in HiveUtils static

2020-04-22 Thread Kent Yao (Jira)

Kent Yao created SPARK-31522:


 Summary: Make immutable configs in HiveUtils static
 Key: SPARK-31522
 URL: https://issues.apache.org/jira/browse/SPARK-31522
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


The following configurations defined in HiveUtils should be considered static:
 # spark.sql.hive.metastore.version - used to determine the hive version in 
Spark
 # spark.sql.hive.version - the fake of the above
 # spark.sql.hive.metastore.jars - hive metastore related jars location which 
is used by spark to create hive client
 # spark.sql.hive.metastore.sharedPrefixes and 
spark.sql.hive.metastore.barrierPrefixes -  packages of classes that are shared 
or separated between SparkContextLoader and hive client class loader

Those are used only once when creating the hive metastore client. They should 
be static in SQLConf for retrieving them correctly. We should avoid them being 
changed by users with SET/RESET command. They should 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31521) The fetch size is not correct when merging blocks into a merged block

2020-04-22 Thread wuyi (Jira)

wuyi created SPARK-31521:


 Summary: The fetch size is not correct when merging blocks into a 
merged block
 Key: SPARK-31521
 URL: https://issues.apache.org/jira/browse/SPARK-31521
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


When merging blocks into a merged block, we should count the size of that 
merged block as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-04-22 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088514#comment-17088514
 ] 

Gabor Somogyi edited comment on SPARK-28367 at 4/22/20, 1:35 PM:
-

I've taken a look at the possibilities given by the new API in 
[KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484].
 I've found the following problems:
* Consumer properties don't match 100% with AdminClient so Consumer properties 
can't be used for instantiation (at the first glance I think adding this to the 
Spark API would be an overkill)
* With the new API by using AdminClient Spark looses the possibility to use the 
assign, subscribe and subscribePattern APIs (implementing this logic would be 
feasible since Kafka consumer does this on client side as well but would be 
ugly).

My main conclusion is that adding AdminClient and using Consumer in a parallel 
way would be super hacky. I would use either Consumer (which doesn't provide 
metadata only at the moment) or AdminClient (where it must be checked whether 
all existing features can be filled + how to add properties).



was (Author: gsomogyi):
I've taken a look at the possibilities given by the new API in 
[KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484].
 I've found the following problems:
* Consumer properties don't match 100% with AdminClient so Consumer properties 
can't be used for instantiation (at the first glance I think adding this to the 
API would be an overkill)
* With the new API by using AdminClient Spark looses the possibility to use the 
assign, subscribe and subscribePattern APIs (implementing this logic would be 
feasible since Kafka consumer does this on client side as well but would be 
ugly).

My main conclusion is that adding AdminClient and using Consumer in a parallel 
way would be super hacky. I would use either Consumer (which doesn't provide 
metadata only at the moment) or AdminClient (where it must be checked whether 
all existing features can be filled + how to add properties).


> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread philipse (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089663#comment-17089663
 ] 

philipse commented on SPARK-31508:
--

haha ,but normall it will be a little complex,we will migrate many hqls to 
sparksql，I suggest it will be better dealed with in the code. Can you help 
review the PR?

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
> Attachments: image-2020-04-22-20-00-09-821.png
>
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31495) Support formatted explain for Adaptive Query Execution

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31495.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28271
[https://github.com/apache/spark/pull/28271]

> Support formatted explain for Adaptive Query Execution
> --
>
> Key: SPARK-31495
> URL: https://issues.apache.org/jira/browse/SPARK-31495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Support formatted explain for AQE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31495) Support formatted explain for Adaptive Query Execution

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31495:
---

Assignee: wuyi

> Support formatted explain for Adaptive Query Execution
> --
>
> Key: SPARK-31495
> URL: https://issues.apache.org/jira/browse/SPARK-31495
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Support formatted explain for AQE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors

2020-04-22 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089617#comment-17089617
 ] 

Gabor Somogyi commented on SPARK-12312:
---

Soon the MS SQL and Oracle server is coming which is not possible to dockerize 
(fully featuring active directory setup is needed which I can't dockerize).
I'm going to test it manually on cluster but if somebody wants to contribute by 
double checking just ping me.


> JDBC connection to Kerberos secured databases fails on remote executors
> ---
>
> Key: SPARK-12312
> URL: https://issues.apache.org/jira/browse/SPARK-12312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.4.2
>Reporter: nabacg
>Priority: Minor
>
> When loading DataFrames from JDBC datasource with Kerberos authentication, 
> remote executors (yarn-client/cluster etc. modes) fail to establish a 
> connection due to lack of Kerberos ticket or ability to generate it. 
> This is a real issue when trying to ingest data from kerberized data sources 
> (SQL Server, Oracle) in enterprise environment where exposing simple 
> authentication access is not an option due to IT policy issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089596#comment-17089596
 ] 

JinxinTang edited comment on SPARK-31508 at 4/22/20, 12:00 PM:
---

for example：

select * from default.test1 where id > 3D

add 'D' after number 3 can get a correct result,this may work,

!image-2020-04-22-20-00-09-821.png!

have fun~


was (Author: jinxintang):
for example：

select * from default.test1 where id > 3D

add 'D' after number 3,this will not convert to long and int may alse work,

have fun~

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
> Attachments: image-2020-04-22-20-00-09-821.png
>
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31520) Add a readiness probe to spark pod definitions

2020-04-22 Thread Dale Richardson (Jira)

Dale Richardson created SPARK-31520:
---

 Summary: Add a readiness probe to spark pod definitions
 Key: SPARK-31520
 URL: https://issues.apache.org/jira/browse/SPARK-31520
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.5
Reporter: Dale Richardson


Add a readiness probe to allow so Kubernetes can communicate basic spark pod 
state to the end user via get/describe pod commands.

A basic TCP/SYN probe of the RPC port should be enough to indicate that a spark 
process is running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31508:
---
Comment: was deleted

(was: such '2...' is far larger than long, I think we should not covert to 
long.)

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089598#comment-17089598
 ] 

JinxinTang edited comment on SPARK-31508 at 4/22/20, 11:49 AM:
---

such '2...' is far larger than long, I think we should not covert to long.


was (Author: jinxintang):
such '2...' is far larger than big, I think we should not covert to long.

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089598#comment-17089598
 ] 

JinxinTang commented on SPARK-31508:


such '2...' is far larger than big, I think we should not covert to long.

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089596#comment-17089596
 ] 

JinxinTang commented on SPARK-31508:


for example：

select * from default.test1 where id > 3D

add 'D' after number 3,this will not convert to long and int may alse work,

have fun~

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089583#comment-17089583
 ] 

JinxinTang commented on SPARK-31508:


You are right,We can see from code gen in branch 2.4 which will convet use 
org.apache.spark.unsafe.types.UTF8String#toInt method

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JinxinTang updated SPARK-31508:
---
Comment: was deleted

(was: You are right,We can see from code gen in branch 2.4 which will convet 
use org.apache.spark.unsafe.types.UTF8String#toInt method)

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31507:
---

Assignee: Kent Yao

> Remove millennium, century, decade, millisecond, microsecond and epoch from 
> extract fucntion
> 
>
> Key: SPARK-31507
> URL: https://issues.apache.org/jira/browse/SPARK-31507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Extracting millennium, century, decade, millisecond, microsecond and epoch 
> from datetime is neither ANSI standard nor quite common in modern SQL 
> platforms. None of the systems listing below support these except PostgreSQL.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
> https://prestodb.io/docs/current/functions/datetime.html
> https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html
> https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts
> https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31507.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28284
[https://github.com/apache/spark/pull/28284]

> Remove millennium, century, decade, millisecond, microsecond and epoch from 
> extract fucntion
> 
>
> Key: SPARK-31507
> URL: https://issues.apache.org/jira/browse/SPARK-31507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Extracting millennium, century, decade, millisecond, microsecond and epoch 
> from datetime is neither ANSI standard nor quite common in modern SQL 
> platforms. None of the systems listing below support these except PostgreSQL.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
> https://prestodb.io/docs/current/functions/datetime.html
> https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html
> https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts
> https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31498) Dump public static sql configurations through doc generation

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31498.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28274
[https://github.com/apache/spark/pull/28274]

> Dump public static sql configurations through doc generation
> 
>
> Key: SPARK-31498
> URL: https://issues.apache.org/jira/browse/SPARK-31498
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, only the non-static public SQL configurations are dump to public 
> doc, we'd better also add those static public ones as the command set -v



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31498) Dump public static sql configurations through doc generation

2020-04-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31498:
---

Assignee: Kent Yao

> Dump public static sql configurations through doc generation
> 
>
> Key: SPARK-31498
> URL: https://issues.apache.org/jira/browse/SPARK-31498
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Currently, only the non-static public SQL configurations are dump to public 
> doc, we'd better also add those static public ones as the command set -v



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-04-22 Thread Prashant Sharma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-30467.
-
Resolution: Cannot Reproduce

> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, If we 
> configured *spark.network.crypto.enabled true* , Spark Workers are not able 
> to create Spark Context because of communication between Spark Worker and 
> Spark Master is failing.
> Default Algorithm ( *_spark.network.crypto.keyFactoryAlgorithm_* ) is set to 
> *_PBKDF2WithHmacSHA1_* and that is one of the Non Approved Cryptographic 
> Algorithm. We had tried so many values from FIPS Approved Cryptographic 
> Algorithm but those values are also not working.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE*
> JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
> please wait.
> JVMDUMP032I JVM requested System dump using 
> '/bin/core.20200109.064150.283.0001.dmp' in response to an event
> JVMDUMP030W Cannot write dump to 
> file/bin/core.20200109.064150.283.0001.dmp: Permission denied
> JVMDUMP012E Error in System dump: The core file created by child process with 
> pid = 375 was not found. Expected to find core file with name 
> "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
> JVMDUMP030W Cannot write dump to file 
> /bin/javacore.20200109.064150.283.0002.txt: Permission denied
> JVMDUMP032I JVM requested Java dump using 
> '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
> JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
> JVMDUMP032I JVM requested Snap dump using 
> '/bin/Snap.20200109.064150.283.0003.trc' in response to an event
> JVMDUMP030W Cannot write dump to file 
> /bin/Snap.20200109.064150.283.0003.trc: Permission denied
> JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
> JVMDUMP030W Cannot write dump to file 
> /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
> JVMDUMP007I JVM Requesting JIT dump using 
> '/tmp/jitdump.20200109.064150.283.0004.dmp'
> JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
> JVMDUMP013I Processed dump event "abort", detail "".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-04-22 Thread Prashant Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089517#comment-17089517
 ] 

Prashant Sharma commented on SPARK-30467:
-

As it is already mentioned here, this is not a SPARK issue. It is an issue 
related to the JVM in use. As I was able to run various FIPS compliant 
configurations [Blog 
link|https://github.com/ScrapCodes/FIPS-compliance/blob/master/blogs/spark-meets-fips.md].

Based on this, I am closing this issue as cannot reproduce. Feel free to 
reopen, if you can get us more detail and a way to reproduce.

> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, If we 
> configured *spark.network.crypto.enabled true* , Spark Workers are not able 
> to create Spark Context because of communication between Spark Worker and 
> Spark Master is failing.
> Default Algorithm ( *_spark.network.crypto.keyFactoryAlgorithm_* ) is set to 
> *_PBKDF2WithHmacSHA1_* and that is one of the Non Approved Cryptographic 
> Algorithm. We had tried so many values from FIPS Approved Cryptographic 
> Algorithm but those values are also not working.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE*
> JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
> please wait.
> JVMDUMP032I JVM requested System dump using 
> '/bin/core.20200109.064150.283.0001.dmp' in response to an event
> JVMDUMP030W Cannot write dump to 
> file/bin/core.20200109.064150.283.0001.dmp: Permission denied
> JVMDUMP012E Error in System dump: The core file created by child process with 
> pid = 375 was not found. Expected to find core file with name 
> "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
> JVMDUMP030W Cannot write dump to file 
> /bin/javacore.20200109.064150.283.0002.txt: Permission denied
> JVMDUMP032I JVM requested Java dump using 
> '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
> JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
> JVMDUMP032I JVM requested Snap dump using 
> '/bin/Snap.20200109.064150.283.0003.trc' in response to an event
> JVMDUMP030W Cannot write dump to file 
> /bin/Snap.20200109.064150.283.0003.trc: Permission denied
> JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
> JVMDUMP030W Cannot write dump to file 
> /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
> JVMDUMP007I JVM Requesting JIT dump using 
> '/tmp/jitdump.20200109.064150.283.0004.dmp'
> JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
> JVMDUMP013I Processed dump event "abort", detail "".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31519) Cast in having aggregate expressions returns the wrong result

2020-04-22 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-31519:
---

 Summary: Cast in having aggregate expressions returns the wrong 
result
 Key: SPARK-31519
 URL: https://issues.apache.org/jira/browse/SPARK-31519
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuanjian Li


Cast in having aggregate expressions returns the wrong result.

See the below tests: 
{code:java}

scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
res0: org.apache.spark.sql.DataFrame = []

scala> val query = """
 | select sum(a) as b, '2020-01-01' as fake
 | from t
 | group by b
 | having b > 10;"""

scala> spark.sql(query).show()
+---+--+
|  b|  fake|
+---+--+
|  2|2020-01-01|
+---+--+

scala> val query = """
 | select sum(a) as b, cast('2020-01-01' as date) as fake
 | from t
 | group by b
 | having b > 10;"""

scala> spark.sql(query).show()
+---++
|  b|fake|
+---++
+---++
{code}
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
resolve the aggregate functions and grouping columns in the Filter operator.
 
It works for simple cases in a very tricky way as it relies on rule execution 
order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
inside aggregate functions, but the function itself is still unresolved as it's 
an UnresolvedFunction. This stops resolving the Filter operator as the child 
Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is 
a resolved Aggregate. This rule can correctly resolve the grouping columns.
 
In the example query, I put a CAST, which needs to be resolved by rule 
ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 
as the Aggregate operator is unresolved at that time. Then the analyzer starts 
next round and the Filter operator is resolved by ResolveReferences, which 
wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Antonin Delpeuch (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089514#comment-17089514
 ] 

Antonin Delpeuch commented on SPARK-31518:
--

Corresponding pull request: https://github.com/apache/spark/pull/28293

> Expose filterByRange in JavaPairRDD
> ---
>
> Key: SPARK-31518
> URL: https://issues.apache.org/jira/browse/SPARK-31518
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: Antonin Delpeuch
>Priority: Major
>  Labels: patch
>
> The `filterByRange` method makes it possible to efficiently filter a sorted 
> RDD using bounds on its keys. It prunes out partitions instead of scanning 
> them if a RangePartitioner is available in the RDD.
> This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
> not exposed in the Java API as far as I can tell. All other methods defined 
> in OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural 
> to expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-22 Thread ZHANG Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZHANG Wei updated SPARK-31516:
--
External issue URL:   (was: https://github.com/apache/spark/pull/28292)

> Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
> ---
>
> Key: SPARK-31516
> URL: https://issues.apache.org/jira/browse/SPARK-31516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: ZHANG Wei
>Priority: Minor
>
> There is a duplicated `hiveClientCalls.count` metric in both 
> `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists 
> of [Spark Monitoring 
> doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
>  but there is only one inside object HiveCatalogMetrics in [source 
> code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
> {quote} * namespace=HiveExternalCatalog
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** fileCacheHits.count
>  ** filesDiscovered.count
>  ** +{color:#ff}*hiveClientCalls.count*{color}+
>  ** parallelListingJobCount.count
>  ** partitionsFetched.count
>  * namespace=CodeGenerator
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** compilationTime (histogram)
>  ** generatedClassSize (histogram)
>  ** generatedMethodSize (histogram)
>  ** *{color:#ff}+hiveClientCalls.count+{color}*
>  ** sourceCodeSize (histogram){quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-22 Thread ZHANG Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZHANG Wei updated SPARK-31516:
--
External issue URL: https://github.com/apache/spark/pull/28292

> Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
> ---
>
> Key: SPARK-31516
> URL: https://issues.apache.org/jira/browse/SPARK-31516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: ZHANG Wei
>Priority: Minor
>
> There is a duplicated `hiveClientCalls.count` metric in both 
> `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists 
> of [Spark Monitoring 
> doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
>  but there is only one inside object HiveCatalogMetrics in [source 
> code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
> {quote} * namespace=HiveExternalCatalog
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** fileCacheHits.count
>  ** filesDiscovered.count
>  ** +{color:#ff}*hiveClientCalls.count*{color}+
>  ** parallelListingJobCount.count
>  ** partitionsFetched.count
>  * namespace=CodeGenerator
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** compilationTime (histogram)
>  ** generatedClassSize (histogram)
>  ** generatedMethodSize (histogram)
>  ** *{color:#ff}+hiveClientCalls.count+{color}*
>  ** sourceCodeSize (histogram){quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-22 Thread ZHANG Wei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZHANG Wei updated SPARK-31516:
--
Description: 
There is a duplicated `hiveClientCalls.count` metric in both 
`namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists of 
[Spark Monitoring 
doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
 but there is only one inside object HiveCatalogMetrics in [source 
code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
{quote} * namespace=HiveExternalCatalog
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** fileCacheHits.count
 ** filesDiscovered.count
 ** +{color:#ff}*hiveClientCalls.count*{color}+
 ** parallelListingJobCount.count
 ** partitionsFetched.count
 * namespace=CodeGenerator
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** compilationTime (histogram)
 ** generatedClassSize (histogram)
 ** generatedMethodSize (histogram)
 ** *{color:#ff}+hiveClientCalls.count+{color}*
 ** sourceCodeSize (histogram){quote}
 

 

  was:
There is a duplicated `hiveClientCalls.count` metric in both 
`namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists of 
[Spark Monitoring 
doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
 but there is only one of object HiveCatalogMetrics in [source 
code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
{quote} * namespace=HiveExternalCatalog
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** fileCacheHits.count
 ** filesDiscovered.count
 ** +{color:#ff}*hiveClientCalls.count*{color}+
 ** parallelListingJobCount.count
 ** partitionsFetched.count
 * namespace=CodeGenerator
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** compilationTime (histogram)
 ** generatedClassSize (histogram)
 ** generatedMethodSize (histogram)
 ** *{color:#ff}+hiveClientCalls.count+{color}*
 ** sourceCodeSize (histogram){quote}
 

 


> Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc
> ---
>
> Key: SPARK-31516
> URL: https://issues.apache.org/jira/browse/SPARK-31516
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: ZHANG Wei
>Priority: Minor
>
> There is a duplicated `hiveClientCalls.count` metric in both 
> `namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists 
> of [Spark Monitoring 
> doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
>  but there is only one inside object HiveCatalogMetrics in [source 
> code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
> {quote} * namespace=HiveExternalCatalog
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** fileCacheHits.count
>  ** filesDiscovered.count
>  ** +{color:#ff}*hiveClientCalls.count*{color}+
>  ** parallelListingJobCount.count
>  ** partitionsFetched.count
>  * namespace=CodeGenerator
>  ** *note:*: these metrics are conditional to a configuration parameter: 
> {{spark.metrics.staticSources.enabled}} (default is true)
>  ** compilationTime (histogram)
>  ** generatedClassSize (histogram)
>  ** generatedMethodSize (histogram)
>  ** *{color:#ff}+hiveClientCalls.count+{color}*
>  ** sourceCodeSize (histogram){quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31518) Expose filterByRange in JavaPairRDD

2020-04-22 Thread Antonin Delpeuch (Jira)

Antonin Delpeuch created SPARK-31518:


 Summary: Expose filterByRange in JavaPairRDD
 Key: SPARK-31518
 URL: https://issues.apache.org/jira/browse/SPARK-31518
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.5
Reporter: Antonin Delpeuch


The `filterByRange` method makes it possible to efficiently filter a sorted RDD 
using bounds on its keys. It prunes out partitions instead of scanning them if 
a RangePartitioner is available in the RDD.

This method is part of the Scala API, defined in OrderedRDDFunctions, but is 
not exposed in the Java API as far as I can tell. All other methods defined in 
OrderedRDDFunctions are exposed in JavaPairRDD, therefore it seems natural to 
expose `filterByRange` there too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-22 Thread Ross Bowen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Bowen updated SPARK-31517:
---
Description: 
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
library(magrittr) 
library(SparkR) 
cars <- cbind(model = rownames(mtcars), mtcars) 
carsDF <- createDataFrame(cars) 

carsDF %>% 
  mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
desc(column("mpg")), desc(column("disp") %>% 
  head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code, 
excluding the use of the `desc()` function also fails:
{code:java}
carsDF %>% 
  mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
column("mpg"), column("disp" %>% 
  head(){code}
 

  was:
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code, 
excluding the use of the `desc()` function also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 


> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
> library(magrittr) 
> library(SparkR) 
> cars <- cbind(model = rownames(mtcars), mtcars) 
> carsDF <- createDataFrame(cars) 
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> desc(column("mpg")), desc(column("disp") %>% 
>   head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
> carsDF %>% 
>   mutate(rank = over(rank(), orderBy(windowPartitionBy(column("cyl")), 
> column("mpg"), column("disp" %>% 
>   head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31517) SparkR::orderBy with multiple columns descending produces error

2020-04-22 Thread Ross Bowen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Bowen updated SPARK-31517:
---
Summary: SparkR::orderBy with multiple columns descending produces error  
(was: orderBy with multiple columns descending does not work properly)

> SparkR::orderBy with multiple columns descending produces error
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
>  library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
> mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
> desc(column("disp") %>% head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
>  carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" 
> %>% head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31517) orderBy with multiple columns descending does not work properly

2020-04-22 Thread Ross Bowen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Bowen updated SPARK-31517:
---
Description: 
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code 
(excluding the use of the `desc()` function also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 

  was:
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code 
(excluding the use of the `desc()` function) also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 


> orderBy with multiple columns descending does not work properly
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
>  library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
> mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
> desc(column("disp") %>% head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code 
> (excluding the use of the `desc()` function also fails:
> {code:java}
>  carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" 
> %>% head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31517) orderBy with multiple columns descending does not work properly

2020-04-22 Thread Ross Bowen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ross Bowen updated SPARK-31517:
---
Description: 
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code, 
excluding the use of the `desc()` function also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 

  was:
When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code 
(excluding the use of the `desc()` function also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 


> orderBy with multiple columns descending does not work properly
> ---
>
> Key: SPARK-31517
> URL: https://issues.apache.org/jira/browse/SPARK-31517
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.5
> Environment: Databricks Runtime 6.5
>Reporter: Ross Bowen
>Priority: Major
>
> When specifying two columns within an `orderBy()` function, to attempt to get 
> an ordering by two columns in descending order, an error is returned.
> {code:java}
>  library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
> mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
> desc(column("disp") %>% head() {code}
> This returns an error:
> {code:java}
>  Error in ns[[i]] : subscript out of bounds{code}
> This seems to be related to the more general issue that the following code, 
> excluding the use of the `desc()` function also fails:
> {code:java}
>  carsDF %>% mutate(rank = over(rank(), 
> orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" 
> %>% head(){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089449#comment-17089449
 ] 

JinxinTang edited comment on SPARK-31508 at 4/22/20, 8:33 AM:
--

select * from test1 where id>'0'; 

may be seems more normal.

Thanks for your issue,I will inspect the code too.


was (Author: jinxintang):
select id from test1 where id>'0'; 

may be seems more normal.

Thanks for your issue,I will inspect the code too.

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31517) orderBy with multiple columns descending does not work properly

2020-04-22 Thread Ross Bowen (Jira)

Ross Bowen created SPARK-31517:
--

 Summary: orderBy with multiple columns descending does not work 
properly
 Key: SPARK-31517
 URL: https://issues.apache.org/jira/browse/SPARK-31517
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.5
 Environment: Databricks Runtime 6.5
Reporter: Ross Bowen


When specifying two columns within an `orderBy()` function, to attempt to get 
an ordering by two columns in descending order, an error is returned.
{code:java}
 library(magrittr) library(SparkR) cars <- cbind(model = rownames(mtcars), 
mtcars) carsDF <- createDataFrame(cars) carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), desc(column("mpg")), 
desc(column("disp") %>% head() {code}
This returns an error:
{code:java}
 Error in ns[[i]] : subscript out of bounds{code}
This seems to be related to the more general issue that the following code 
(excluding the use of the `desc()` function) also fails:
{code:java}
 carsDF %>% mutate(rank = over(rank(), 
orderBy(windowPartitionBy(column("cyl")), column("mpg"), column("disp" %>% 
head(){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089449#comment-17089449
 ] 

JinxinTang commented on SPARK-31508:


select id from test1 where id>'0'; 

may be seems more normal.

Thanks for your issue,I will inspect the code too.

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value，it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31516) Non-existed metric hiveClientCalls.count of CodeGenerator in Monitoring Doc

2020-04-22 Thread ZHANG Wei (Jira)

ZHANG Wei created SPARK-31516:
-

 Summary: Non-existed metric hiveClientCalls.count of CodeGenerator 
in Monitoring Doc
 Key: SPARK-31516
 URL: https://issues.apache.org/jira/browse/SPARK-31516
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Affects Versions: 3.0.0
Reporter: ZHANG Wei


There is a duplicated `hiveClientCalls.count` metric in both 
`namespace=HiveExternalCatalog` and  `namespace=CodeGenerator` bullet lists of 
[Spark Monitoring 
doc|https://spark.apache.org/docs/3.0.0-preview2/monitoring.html#component-instance--executor],
 but there is only one of object HiveCatalogMetrics in [source 
code|https://github.com/apache/spark/blob/6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1/core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala#L85].
{quote} * namespace=HiveExternalCatalog
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** fileCacheHits.count
 ** filesDiscovered.count
 ** +{color:#ff}*hiveClientCalls.count*{color}+
 ** parallelListingJobCount.count
 ** partitionsFetched.count
 * namespace=CodeGenerator
 ** *note:*: these metrics are conditional to a configuration parameter: 
{{spark.metrics.staticSources.enabled}} (default is true)
 ** compilationTime (histogram)
 ** generatedClassSize (histogram)
 ** generatedMethodSize (histogram)
 ** *{color:#ff}+hiveClientCalls.count+{color}*
 ** sourceCodeSize (histogram){quote}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089372#comment-17089372
 ] 

JinxinTang commented on SPARK-31515:


Could you please provide code piece or picture to depict this.

> Canonicalize Cast should consider the value of needTimeZone
> ---
>
> Key: SPARK-31515
> URL: https://issues.apache.org/jira/browse/SPARK-31515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: If we don't need to use `timeZone` information casting 
> `from` type to `to` type, then the timeZoneId should not influence the 
> canonicalize result. 
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31400) The catalogString doesn't distinguish Vectors in ml and mllib

2020-04-22 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089317#comment-17089317
 ] 

JinxinTang commented on SPARK-31400:


Thanks for report this issue, this is in progress:

[PR|https://github.com/apache/spark/pull/28291]

> The catalogString doesn't distinguish Vectors in ml and mllib
> -
>
> Key: SPARK-31400
> URL: https://issues.apache.org/jira/browse/SPARK-31400
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5
> Environment: Ubuntu 16.04
>Reporter: Junpei Zhou
>Priority: Major
>
> h2. Bug Description
> The `catalogString` is not detailed enough to distinguish the 
> pyspark.ml.linalg.Vectors and pyspark.mllib.linalg.Vectors.
> h2. How to reproduce the bug
> [Here|https://spark.apache.org/docs/latest/ml-features#minmaxscaler] is an 
> example from the official document (Python code). If I keep all other lines 
> untouched, and only modify the Vectors import line, which means:
> {code:java}
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> {code}
> Or you can directly execute the following code snippet:
> {code:java}
> from pyspark.ml.feature import MinMaxScaler
> # from pyspark.ml.linalg import Vectors
> from pyspark.mllib.linalg import Vectors
> dataFrame = spark.createDataFrame([
> (0, Vectors.dense([1.0, 0.1, -1.0]),),
> (1, Vectors.dense([2.0, 1.1, 1.0]),),
> (2, Vectors.dense([3.0, 10.1, 3.0]),)
> ], ["id", "features"])
> scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
> scalerModel = scaler.fit(dataFrame)
> {code}
> It will raise an error:
> {code:java}
> IllegalArgumentException: 'requirement failed: Column features must be of 
> type struct,values:array> 
> but was actually 
> struct,values:array>.'
> {code}
> However, the actually struct and the desired struct are exactly the same 
> string, which cannot provide useful information to the programmer. I would 
> suggest making the catalogString distinguish pyspark.ml.linalg.Vectors and 
> pyspark.mllib.linalg.Vectors.
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

65 matches

Mail list logo