[jira] [Comment Edited] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069815#comment-16069815
 ] 

Marco Gaido edited comment on SPARK-14516 at 3/27/23 9:42 AM:
--

Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
-[https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view|https://drive.google.com/file/d/0B0Hyo__bG_3fdkNvSVNYX2E3ZU0/view].-
 (please refer to https://arxiv.org/abs/2303.14102).

Please tell me if you have any comment, doubt on it.

Thanks.


was (Author: mgaido):
Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view.

Please tell me if you have any comment, doubt on it.

Thanks.


> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705265#comment-17705265
 ] 

Marco Gaido commented on SPARK-14516:
-

As there are some issues with the Google Doc sharing, I prepared a paper that 
explains the metric and its definition, please disregard the previous link and 
refer to [https://arxiv.org/abs/2303.14102] if you are interested. It would be 
also easier for me to update it in the future if needed. Thanks.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36673) Incorrect Unions of struct with mismatched field name case

2021-09-06 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410488#comment-17410488
 ] 

Marco Gaido commented on SPARK-36673:
-

AFAIK, in SQL the names in the struct are case sensitive, while the name of the 
normal fields are not. I am not sure about the right behavior here, but maybe I 
would expect an error at analysis time. Definitely, the current behavior is not 
correct.

> Incorrect Unions of struct with mismatched field name case
> --
>
> Key: SPARK-36673
> URL: https://issues.apache.org/jira/browse/SPARK-36673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> If a nested field has different casing on two sides of the union, the 
> resultant schema of the union will both fields in its schemaa
> {code:java}
> scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
> INNER")))
> df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
> df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> scala> df1.union(df2).printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- INNER: long (nullable = false)
>  ||-- inner: long (nullable = false)
>  {code}
> This seems like a bug. I would expect that Spark SQL would either just union 
> by index or if the user has requested {{unionByName}}, then it should matched 
> fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}.
> However the output data only has one nested column
> {code:java}
> scala> df1.union(df2).show()
> +---+--+
> | id|nested|
> +---+--+
> |  0|   {0}|
> |  1|   {5}|
> |  0|   {0}|
> |  1|   {5}|
> +---+--+
> {code}
> Trying to project fields of {{nested}} throws an error:
> {code:java}
> scala> df1.union(df2).select("nested.*").show()
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>   at 
> 

[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-04 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987712#comment-16987712
 ] 

Marco Gaido commented on SPARK-29667:
-

I can agree more with you [~hyukjin.kwon]. I think that having different 
coercion rules for the two types of IN is very confusing. It'd be great for 
such things to be consistent among all the framework in order to avoid 
"surprises" for users IMHO.

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> {code}
> the sql and clause
> {code}
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29123) DecimalType multiplication precision loss

2019-09-19 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933560#comment-16933560
 ] 

Marco Gaido commented on SPARK-29123:
-

[~benny] the point here is: Spark can represent decimals with max precision 38. 
When defining the result type multiple options are possible.

We decided to follow SQLServer implementation by default (you can check their 
docs too). There isn't much more docs as tehre isn't much more to say. Since 
the precision is limited, there are 2 options:

 - in case you do not allow precision loss, an overflow can happen. In that 
case Spark returns NULL.
 - in case you allow precision loss, precision loss is preferred over overflow. 
This is the behavior SQLServer has and it is ANSI compliant.

You can see the PR for SPARK-22036 for more details.


> DecimalType multiplication precision loss 
> --
>
> Key: SPARK-29123
> URL: https://issues.apache.org/jira/browse/SPARK-29123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Benny Lu
>Priority: Major
>
> When doing multiplication with PySpark, it seems PySpark is losing precision.
> For example, when multiplying two decimals with precision 38,10, it returns 
> 38,6 instead of 38,10. It also truncates result to three decimals which is 
> incorrect result. 
> {code:java}
> from decimal import Decimal
> from pyspark.sql.types import DecimalType, StructType, StructField
> schema = StructType([StructField("amount", DecimalType(38,10)), 
> StructField("fx", DecimalType(38,10))])
> df = spark.createDataFrame([(Decimal(233.00), Decimal(1.1403218880))], 
> schema=schema)
> df.printSchema()
> df = df.withColumn("amount_usd", df.amount * df.fx)
> df.printSchema()
> df.show()
> {code}
> Result
> {code:java}
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df = df.withColumn("amount_usd", df.amount * df.fx)
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df.show()
> +--++--+
> |amount|  fx|amount_usd|
> +--++--+
> |233.00|1.1403218880|265.695000|
> +--++--+
> {code}
> When rounding to two decimals, it returns 265.70 but the correct result 
> should be 265.69499 and when rounded to two decimals, it should be 265.69.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29123) DecimalType multiplication precision loss

2019-09-19 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933128#comment-16933128
 ] 

Marco Gaido commented on SPARK-29123:
-

You can set {{spark.sql.decimalOperations.allowPrecisionLoss}} if you do not 
want to risk truncations in your operations. Otherwise, tuning properly the 
precision and scale of your input schema helps too.

> DecimalType multiplication precision loss 
> --
>
> Key: SPARK-29123
> URL: https://issues.apache.org/jira/browse/SPARK-29123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Benny Lu
>Priority: Major
>
> When doing multiplication with PySpark, it seems PySpark is losing precision.
> For example, when multiplying two decimals with precision 38,10, it returns 
> 38,6 instead of 38,10. It also truncates result to three decimals which is 
> incorrect result. 
> {code:java}
> from decimal import Decimal
> from pyspark.sql.types import DecimalType, StructType, StructField
> schema = StructType([StructField("amount", DecimalType(38,10)), 
> StructField("fx", DecimalType(38,10))])
> df = spark.createDataFrame([(Decimal(233.00), Decimal(1.1403218880))], 
> schema=schema)
> df.printSchema()
> df = df.withColumn("amount_usd", df.amount * df.fx)
> df.printSchema()
> df.show()
> {code}
> Result
> {code:java}
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df = df.withColumn("amount_usd", df.amount * df.fx)
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df.show()
> +--++--+
> |amount|  fx|amount_usd|
> +--++--+
> |233.00|1.1403218880|265.695000|
> +--++--+
> {code}
> When rounding to two decimals, it returns 265.70 but the correct result 
> should be 265.69499 and when rounded to two decimals, it should be 265.69.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926650#comment-16926650
 ] 

Marco Gaido commented on SPARK-29038:
-

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926650#comment-16926650
 ] 

Marco Gaido edited comment on SPARK-29038 at 9/10/19 1:40 PM:
--

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My understanding 
is that your proposal is to do something very similar, just with a different 
syntax, more DB oriented. Is my understanding correct?


was (Author: mgaido):
[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29009) Returning pojo from udf not working

2019-09-07 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924774#comment-16924774
 ] 

Marco Gaido commented on SPARK-29009:
-

Why do you think this is a bug? If you want a struct to be returned by your 
UDF, you should return a {{Row}}.

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28610) Support larger buffer for sum of long

2019-09-03 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921471#comment-16921471
 ] 

Marco Gaido commented on SPARK-28610:
-

Hi [~Gengliang.Wang]. That's a different thing. you are doing 3 {{Add}}s and 
then a sum of 1 number. To reproduce this, you should create a table with 3 
rows with those vlues and sum them. Thanks.

> Support larger buffer for sum of long
> -
>
> Key: SPARK-28610
> URL: https://issues.apache.org/jira/browse/SPARK-28610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> The sum of a long field currently uses a buffer of type long.
> When the flag for throwing exceptions on overflow for arithmetic operations 
> in turned on, this is a problem in case there are intermediate overflows 
> which are then resolved by other rows. Indeed, in such a case, we are 
> throwing an exception, while the result is representable in a long value. An 
> example of this issue can be seen running:
> {code}
> val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
> df.select(sum($"a")).show()
> {code}
> According to [~cloud_fan]'s suggestion in 
> https://github.com/apache/spark/pull/21599, we should introduce a flag in 
> order to let users choose among a wider datatype for the sum buffer using a 
> config, so that the above issue can be fixed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-28939:

Description: 
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
un-effective.

The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
And this is pretty frequent in the codebase.

Please notice that there are 2 parts of this issue:
 - when a user directly uses those APIs
 - when Spark invokes them (eg. throughout the ML lib and other usages or the 
{{describe}} method on the {{Dataset}} class)



  was:
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.


> SQL configuration are not always propagated
> ---
>
> Key: SPARK-28939
> URL: https://issues.apache.org/jira/browse/SPARK-28939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marco Gaido
>Priority: Major
>
> The SQL configurations are propagated to executors in order to be effective.
> Unfortunately, in some cases, we are missing to propagate them, making them 
> un-effective.
> The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
> And this is pretty frequent in the codebase.
> Please notice that there are 2 parts of this issue:
>  - when a user directly uses those APIs
>  - when Spark invokes them (eg. throughout the ML lib and other usages or the 
> {{describe}} method on the {{Dataset}} class)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)
Marco Gaido created SPARK-28939:
---

 Summary: SQL configuration are not always propagated
 Key: SPARK-28939
 URL: https://issues.apache.org/jira/browse/SPARK-28939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Marco Gaido


The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-31 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920153#comment-16920153
 ] 

Marco Gaido commented on SPARK-28916:
-

I think the problem is related to subexpression elimination. I've not been able 
to confirm since for some reasons I am not able to disable it, even though I 
set the config to false, it is performed anyway. Maybe I am missing something 
there. Anyway, you may try and set 
{{spark.sql.subexpressionElimination.enabled}} to {{false}}. Meanwhile I am 
working on a fix. Thanks.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps:
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then,The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> 

[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-31 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920074#comment-16920074
 ] 

Marco Gaido commented on SPARK-28916:
-

Thanks for reporting this. I am checking it.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps:
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then,The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> 

[jira] [Commented] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2019-08-31 Thread Marco Gaido (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920072#comment-16920072
 ] 

Marco Gaido commented on SPARK-28934:
-

Hi [~smilegator]! Thanks for opening this. I am wondering whether it may be 
worth to reopen SPARK-28610 and enable the option for the pgSQL compatibility 
mode. [~cloud_fan] what do you think?

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28610) Support larger buffer for sum of long

2019-08-28 Thread Marco Gaido (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-28610.
-
Resolution: Won't Fix

Since the perf regression introduced by the change would be very high, this 
won't be fixed. Thanks.

> Support larger buffer for sum of long
> -
>
> Key: SPARK-28610
> URL: https://issues.apache.org/jira/browse/SPARK-28610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> The sum of a long field currently uses a buffer of type long.
> When the flag for throwing exceptions on overflow for arithmetic operations 
> in turned on, this is a problem in case there are intermediate overflows 
> which are then resolved by other rows. Indeed, in such a case, we are 
> throwing an exception, while the result is representable in a long value. An 
> example of this issue can be seen running:
> {code}
> val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
> df.select(sum($"a")).show()
> {code}
> According to [~cloud_fan]'s suggestion in 
> https://github.com/apache/spark/pull/21599, we should introduce a flag in 
> order to let users choose among a wider datatype for the sum buffer using a 
> config, so that the above issue can be fixed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28611) Histogram's height is diffrent

2019-08-04 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899585#comment-16899585
 ] 

Marco Gaido commented on SPARK-28611:
-

Mmmhthat's weird! How can you get a different result than what was got on 
the CI? How are you running the code?

> Histogram's height is diffrent
> --
>
> Key: SPARK-28611
> URL: https://issues.apache.org/jira/browse/SPARK-28611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
> -- Test output for histogram statistics
> SET spark.sql.statistics.histogram.enabled=true;
> SET spark.sql.statistics.histogram.numBins=2;
> INSERT INTO desc_col_table values 1, 2, 3, 4;
> ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
> DESC EXTENDED desc_col_table key;
> {code}
> {noformat}
> spark-sql> DESC EXTENDED desc_col_table key;
> col_name  key
> data_type int
> comment   column_comment
> min   1
> max   4
> num_nulls 0
> distinct_count4
> avg_col_len   4
> max_col_len   4
> histogram height: 4.0, num_of_bins: 2
> bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
> bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
> {noformat}
> But our result is:
> https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28610) Support larger buffer for sum of long

2019-08-03 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28610:
---

 Summary: Support larger buffer for sum of long
 Key: SPARK-28610
 URL: https://issues.apache.org/jira/browse/SPARK-28610
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


The sum of a long field currently uses a buffer of type long.

When the flag for throwing exceptions on overflow for arithmetic operations in 
turned on, this is a problem in case there are intermediate overflows which are 
then resolved by other rows. Indeed, in such a case, we are throwing an 
exception, while the result is representable in a long value. An example of 
this issue can be seen running:

{code}
val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
df.select(sum($"a")).show()
{code}

According to [~cloud_fan]'s suggestion in 
https://github.com/apache/spark/pull/21599, we should introduce a flag in order 
to let users choose among a wider datatype for the sum buffer using a config, 
so that the above issue can be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892692#comment-16892692
 ] 

Marco Gaido commented on SPARK-28512:
-

Thanks for pinging me [~maropu]. It is not the same issue, I think, because in 
SPARK-28470 we are dealing only with cases when there is an overflow. Here 
there is no overflow. Simply the value is not valid for the casted type.

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast

2019-07-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889979#comment-16889979
 ] 

Marco Gaido commented on SPARK-28470:
-

Thanks for checking this Wenchen! I will work on this ASAP. Thanks.

> Honor spark.sql.decimalOperations.nullOnOverflow in Cast
> 
>
> Key: SPARK-28470
> URL: https://issues.apache.org/jira/browse/SPARK-28470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> cast long to decimal or decimal to decimal can overflow, we should respect 
> the new config if overflow happens.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions

2019-07-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889518#comment-16889518
 ] 

Marco Gaido edited comment on SPARK-28225 at 7/20/19 2:43 PM:
--

Let me cite PostgreSQL documentation to explain you the behavior:

??When an aggregate function is used as a window function, it aggregates over 
the rows within the current row's window frame. An aggregate used with ORDER BY 
and the default window frame definition produces a "running sum" type of 
behavior, which may or may not be what's wanted. To obtain aggregation over the 
whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other 
effects.??

So the returned values seem correct to me.


was (Author: mgaido):
Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the 
rows within the current row's window frame. An aggregate used with ORDER BY and 
the default window frame definition produces a "running sum" type of behavior, 
which may or may not be what's wanted. To obtain aggregation over the whole 
partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING. Other frame specifications can be used to obtain other effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions

2019-07-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889518#comment-16889518
 ] 

Marco Gaido commented on SPARK-28225:
-

Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the 
rows within the current row's window frame. An aggregate used with ORDER BY and 
the default window frame definition produces a "running sum" type of behavior, 
which may or may not be what's wanted. To obtain aggregation over the whole 
partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING. Other frame specifications can be used to obtain other effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28386) Cannot resolve ORDER BY columns with GROUP BY and HAVING

2019-07-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889422#comment-16889422
 ] 

Marco Gaido commented on SPARK-28386:
-

I think this is a duplicate of SPARK-26741. I have a PR for it but it seems a 
bit stuck. Any help in reviewing it would be very appreciated. Thanks.

> Cannot resolve ORDER BY columns with GROUP BY and HAVING
> 
>
> Key: SPARK-28386
> URL: https://issues.apache.org/jira/browse/SPARK-28386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE test_having (a int, b int, c string, d string) USING parquet;
> INSERT INTO test_having VALUES (0, 1, '', 'A');
> INSERT INTO test_having VALUES (1, 2, '', 'b');
> INSERT INTO test_having VALUES (2, 2, '', 'c');
> INSERT INTO test_having VALUES (3, 3, '', 'D');
> INSERT INTO test_having VALUES (4, 3, '', 'e');
> INSERT INTO test_having VALUES (5, 3, '', 'F');
> INSERT INTO test_having VALUES (6, 4, '', 'g');
> INSERT INTO test_having VALUES (7, 4, '', 'h');
> INSERT INTO test_having VALUES (8, 4, '', 'I');
> INSERT INTO test_having VALUES (9, 4, '', 'j');
> SELECT lower(c), count(c) FROM test_having
>   GROUP BY lower(c) HAVING count(*) > 2
>   ORDER BY lower(c);
> {code}
> {noformat}
> spark-sql> SELECT lower(c), count(c) FROM test_having
>  > GROUP BY lower(c) HAVING count(*) > 2
>  > ORDER BY lower(c);
> Error in query: cannot resolve '`c`' given input columns: [lower(c), 
> count(c)]; line 3 pos 19;
> 'Sort ['lower('c) ASC NULLS FIRST], true
> +- Project [lower(c)#158, count(c)#159L]
>+- Filter (count(1)#161L > cast(2 as bigint))
>   +- Aggregate [lower(c#7)], [lower(c#7) AS lower(c)#158, count(c#7) AS 
> count(c)#159L, count(1) AS count(1)#161L]
>  +- SubqueryAlias test_having
> +- Relation[a#5,b#6,c#7,d#8] parquet
> {noformat}
> But it works when setting an alias:
> {noformat}
> spark-sql> SELECT lower(c) withAias, count(c) FROM test_having
>  > GROUP BY lower(c) HAVING count(*) > 2
>  > ORDER BY withAias;
> 3
>   4
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap

2019-07-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886406#comment-16886406
 ] 

Marco Gaido commented on SPARK-23758:
-

[~dongjoon] seems weird to set the affected version to 3.0 for this one. Shall 
we rather close it?

> MLlib 2.4 Roadmap
> -
>
> Key: SPARK-23758
> URL: https://issues.apache.org/jira/browse/SPARK-23758
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
>  | (empty) | (any) | yes | no | maybe |
> | [7 | 
> 

[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884412#comment-16884412
 ] 

Marco Gaido commented on SPARK-28222:
-

[~eneriwrt] do you have a simple repro for this? I can try and check it if I 
have an example to debug.

> Feature importance outputs different values in GBT and Random Forest in 2.3.3 
> and 2.4 pyspark version
> -
>
> Key: SPARK-28222
> URL: https://issues.apache.org/jira/browse/SPARK-28222
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: eneriwrt
>Priority: Minor
>
> Feature importance values obtained in a binary classification project outputs 
> different values if 2.3.3 version used or 2.4.0. It happens in Random Forest 
> and GBT. Turns out that values that are equal than sklearn output are from 
> 2.3.3 version. 
> As an example:
> *SPARK 2.4*
>  MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 
> 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 
> 0.06324418636918638]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7556, 
> 0.24438, 0.0, 1.4602196686471875e-17, 0.0]
> *SPARK 2.3.3*
>  MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 
> 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 
> 0.05991085303585305]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7555, 
> 0.24438, 0.0, 2.4326753518951276e-17, 0.0]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28316) Decimal precision issue

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884398#comment-16884398
 ] 

Marco Gaido commented on SPARK-28316:
-

Well, IIUC, this is just the result of Postgres having no limit on decimal 
precision, while Spark's Decimal max precision is 38. Our decimal 
implementation draws from SQLServer's (and Hive's, which follows SQLServer) 
one. 

> Decimal precision issue
> ---
>
> Key: SPARK-28316
> URL: https://issues.apache.org/jira/browse/SPARK-28316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Multiply check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10));
> 1179132047626883.596862
> -- PostgreSQL
> postgres=# select cast(-34338492.215397047 as numeric(38, 10)) * 
> cast(-34338492.215397047 as numeric(38, 10));
>?column?
> ---
>  1179132047626883.59686213585632020900
> (1 row)
> {code}
> Division check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(93901.57763026 as decimal(38, 10)) / cast(4.31 as 
> decimal(38, 10));
> 21786.908963
> -- PostgreSQL
> postgres=# select cast(93901.57763026 as numeric(38, 10)) / cast(4.31 as 
> numeric(38, 10));
>   ?column?
> 
>  21786.908962937355
> (1 row)
> {code}
> POWER(10, LN(value)) check:
> {code:sql}
> -- Spark SQL
> spark-sql> SELECT CAST(POWER(cast('10' as decimal(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as decimal(38, 10)),200 AS 
> decimal(38, 10));
> 107511333880051856
> -- PostgreSQL
> postgres=# SELECT CAST(POWER(cast('10' as numeric(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as numeric(38, 10)),200 AS 
> numeric(38, 10));
>  power
> ---
>  107511333880052007.0414112467
> (1 row)
> {code}
> AVG, STDDEV and VARIANCE returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> create temporary view t1 as select * from values
>  >   (cast(-24926804.04504742 as decimal(38, 10))),
>  >   (cast(16397.038491 as decimal(38, 10))),
>  >   (cast(7799461.4119 as decimal(38, 10)))
>  >   as t1(t);
> spark-sql> SELECT AVG(t), STDDEV(t), VARIANCE(t) FROM t1;
> -5703648.53155214 1.7096528995154984E72.922913036821751E14
> -- PostgreSQL
> postgres=# SELECT AVG(t), STDDEV(t), VARIANCE(t)  from (values 
> (cast(-24926804.04504742 as decimal(38, 10))), (cast(16397.038491 as 
> decimal(38, 10))), (cast(7799461.4119 as decimal(38, 10 t1(t);
>   avg  |stddev |   
> variance
> ---+---+--
>  -5703648.53155214 | 17096528.99515498420743029415 | 
> 292291303682175.094017569588
> (1 row)
> {code}
> EXP returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> select exp(cast(1.0 as decimal(31,30)));
> 2.718281828459045
> -- PostgreSQL
> postgres=# select exp(cast(1.0 as decimal(31,30)));
>exp
> --
>  2.718281828459045235360287471353
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28324) The LOG function using 10 as the base, but Spark using E

2019-07-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884397#comment-16884397
 ] 

Marco Gaido commented on SPARK-28324:
-

+1 for [~srowen]'s opinion. I don't think it is a good idea to change the 
behavior here.

> The LOG function using 10 as the base, but Spark using E
> 
>
> Key: SPARK-28324
> URL: https://issues.apache.org/jira/browse/SPARK-28324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> select log(10);
> 2.302585092994046
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# select log(10);
>  log
> -
>1
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883165#comment-16883165
 ] 

Marco Gaido commented on SPARK-28348:
-

Mmmhyes, you're right.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882773#comment-16882773
 ] 

Marco Gaido commented on SPARK-28348:
-

No, I don't think that's a good idea. Setting the result type to {{decimal(p3, 
s3)}} could cause the overflow issues for which 
https://issues.apache.org/jira/browse/SPARK-22036 has been done. On the other 
side, avoiding the cast would change the result type. So I don't see a good way 
to change this.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882741#comment-16882741
 ] 

Marco Gaido commented on SPARK-28348:
-

There is no cast to `decimal(38, 6)`. The reson why the result is "truncated" 
at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882741#comment-16882741
 ] 

Marco Gaido edited comment on SPARK-28348 at 7/11/19 8:23 AM:
--

There is no cast to {{decimal(38, 6)}}. The reason why the result is 
"truncated" at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.


was (Author: mgaido):
There is no cast to `decimal(38, 6)`. The reson why the result is "truncated" 
at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28322) DIV support decimal type

2019-07-10 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881798#comment-16881798
 ] 

Marco Gaido commented on SPARK-28322:
-

Thanks for pinging me [~yumwang], I'll work on this on the weekend. Thanks!

> DIV support decimal type
> 
>
> Key: SPARK-28322
> URL: https://issues.apache.org/jira/browse/SPARK-28322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
> Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS 
> DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div 
> CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 
> pos 7;
> 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as 
> decimal(10,0))), None)]
> +- OneRowRelation
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
>  div
> -
>3
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-07-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880700#comment-16880700
 ] 

Marco Gaido commented on SPARK-28067:
-

I cannot reproduce in 2.4.0 either:

{code}

spark-2.4.0-bin-hadoop2.7 xxx$ ./bin/spark-shell
2019-07-08 22:52:11 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://xxx:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1562619141279).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/
 
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
+---+
|sum(decNum)|
+---+
|null   |
+---+
{code}

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the 

[jira] [Created] (SPARK-28235) Decimal sum return type

2019-07-02 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28235:
---

 Summary: Decimal sum return type
 Key: SPARK-28235
 URL: https://issues.apache.org/jira/browse/SPARK-28235
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


Our implementation of decimal operations follows SQLServer behavior. As per 
https://docs.microsoft.com/it-it/sql/t-sql/functions/sum-transact-sql?view=sql-server-2017,
 the result of sum operation should be `DECIMAL(38, s)` while currently we are 
setting it to `DECIMAL(10 + p, s)`. This means that with large datasets, we may 
incur in overflow, even though we may have been able to represent the value 
with higher precision and SQLServer returns correct results in that case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

2019-07-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877267#comment-16877267
 ] 

Marco Gaido commented on SPARK-28222:
-

Mmmmh, there has been a bug fix for it (see SPARK-26721), but it should be in 
3.0 only AFAIK. The question is: which is the rigth value? Can you compare it 
with other libs like sklearn?

> Feature importance outputs different values in GBT and Random Forest in 2.3.3 
> and 2.4 pyspark version
> -
>
> Key: SPARK-28222
> URL: https://issues.apache.org/jira/browse/SPARK-28222
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: eneriwrt
>Priority: Minor
>
> Feature importance values obtained in a binary classification project outputs 
> different values if 2.3.3 version used or 2.4.0. It happens in Random Forest 
> and GBT.
> As an example:
> *SPARK 2.4*
> MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 
> 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 
> 0.06324418636918638]
> MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 
> 0.055888697733790925]
> MODEL GradientBoostingClassifier [0.0, 0.7556, 
> 0.24438, 0.0, 1.4602196686471875e-17, 0.0]
> *SPARK 2.3.3*
> MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 
> 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 
> 0.05991085303585305]
> MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 
> 0.055888697733790925]
> MODEL GradientBoostingClassifier [0.0, 0.7555, 
> 0.24438, 0.0, 2.4326753518951276e-17, 0.0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-07-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876509#comment-16876509
 ] 

Marco Gaido commented on SPARK-28186:
-

You're right with that. The equivalent in Postgres is {{=ANY}} which behaves 
like current Spark. So I don't see a string motivation to change the current 
Spark behavior.

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-07-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876357#comment-16876357
 ] 

Marco Gaido commented on SPARK-28186:
-

Do you know of any SQL BD with the behavior you are suggesting?

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-06-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875447#comment-16875447
 ] 

Marco Gaido commented on SPARK-28186:
-

This is the right behavior AFAIK. Why are you saying it is wrong?

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874765#comment-16874765
 ] 

Marco Gaido commented on SPARK-28201:
-

I'll create a PR for this ASAP.

> Revisit MakeDecimal behavior on overflow
> 
>
> Key: SPARK-28201
> URL: https://issues.apache.org/jira/browse/SPARK-28201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out in 
> https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
> cases of decimal aggregation we are using the `MakeDecimal` operator.
> This operator has a not well defined behavior in case of overflow, namely 
> what it does currently is:
>  - if codegen is enabled it returns null;
>  -  in interpreted mode it throws an `IllegalArgumentException`.
> So we should make his behavior uniform with other similar cases and in 
> particular we should honor the value of the conf introduced in SPARK-23179 
> and behave accordingly, ie.:
>  - returning null if the flag is true;
>  - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28201:
---

 Summary: Revisit MakeDecimal behavior on overflow
 Key: SPARK-28201
 URL: https://issues.apache.org/jira/browse/SPARK-28201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in 
https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
cases of decimal aggregation we are using the `MakeDecimal` operator.

This operator has a not well defined behavior in case of overflow, namely what 
it does currently is:

 - if codegen is enabled it returns null;
 -  in interpreted mode it throws an `IllegalArgumentException`.

So we should make his behavior uniform with other similar cases and in 
particular we should honor the value of the conf introduced in SPARK-23179 and 
behave accordingly, ie.:

 - returning null if the flag is true;
 - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28200) Overflow handling in `ExpressionEncoder`

2019-06-28 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-28200:
---

 Summary: Overflow handling in `ExpressionEncoder`
 Key: SPARK-28200
 URL: https://issues.apache.org/jira/browse/SPARK-28200
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in https://github.com/apache/spark/pull/20350, we are currently 
not checking the overflow when serializing a java/scala `BigDecimal` in 
`ExpressionEncoder` / `ScalaReflection`.

We should add this check there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870880#comment-16870880
 ] 

Marco Gaido commented on SPARK-28067:
-

No, it is the same. Are you sure about your configs?

{code}
macmarco:spark mark9$ git log -5 --oneline
5ad1053f3e (HEAD, apache/master) [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs 
skip empty partitions
113f8c8d13 [SPARK-28132][PYTHON] Update document type conversion for Pandas 
UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7)
9b9d81b821 [SPARK-28131][PYTHON] Update document type conversion between Python 
data and SQL types in normal UDFs (Python 3.7)
54da3bbfb2 [SPARK-28127][SQL] Micro optimization on TreeNode's mapChildren 
method
47f54b1ec7 [SPARK-28118][CORE] Add `spark.eventLog.compression.codec` 
configuration
macmarco:spark mark9$ ./bin/spark-shell 
19/06/24 09:17:58 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://.:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1561360686725).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(decNum#14)])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(decNum#14)])
  +- *(1) Project [decNum#14]
 +- *(1) BroadcastHashJoin [intNum#8], [intNum#15], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:  +- LocalTableScan [intNum#8]
+- LocalTableScan [decNum#14, intNum#15]



scala> df2.show(40,false)
+---+   
|sum(decNum)|
+---+
|null   |
+---+
{code}

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 

[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870360#comment-16870360
 ] 

Marco Gaido commented on SPARK-28135:
-

[~Tonix517] tickets are assigned only once the PR is merged and the ticket is 
close. So please go ahead submitting the PR and the committer who will 
eventually merge it will assign the ticket to you. Thanks.

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870293#comment-16870293
 ] 

Marco Gaido commented on SPARK-28067:
-

I cannot reproduce on master. It always returns null with whole stage codegen 
enabled.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28060) Float/Double type can not accept some special inputs

2019-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870288#comment-16870288
 ] 

Marco Gaido commented on SPARK-28060:
-

This is a duplicate of SPARK-27768, isn't it? Or better, SPARK-27768 is a 
subpart of this? Anyway, shall we close either this one or SPARK-27768?

> Float/Double type can not accept some special inputs
> 
>
> Key: SPARK-28060
> URL: https://issues.apache.org/jira/browse/SPARK-28060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Query||Spark SQL||PostgreSQL||
> |SELECT float('nan');|NULL|NaN|
> |SELECT float('   NAN  ');|NULL|NaN|
> |SELECT float('infinity');|NULL|Infinity|
> |SELECT float('  -INFINiTY   ');|NULL|-Infinity|
> ||Query||Spark SQL||PostgreSQL||
> |SELECT double('nan');|NULL|NaN|
> |SELECT double('   NAN  ');|NULL|NaN|
> |SELECT double('infinity');|NULL|Infinity|
> |SELECT double('  -INFINiTY   ');|NULL|-Infinity|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27820) case insensitive resolver should be used in GetMapValue

2019-06-22 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870286#comment-16870286
 ] 

Marco Gaido commented on SPARK-27820:
-

+1 for [~hyukjin.kwon]'s comment.

> case insensitive resolver should be used in GetMapValue
> ---
>
> Key: SPARK-27820
> URL: https://issues.apache.org/jira/browse/SPARK-27820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Michel Lemay
>Priority: Minor
>
> When extracting a key value from a MapType, it calls GetMapValue 
> (complexTypeExtractors.scala) and only use the map type ordering. It should 
> use the resolver instead.
> Starting spark with: `{{spark-shell --conf spark.sql.caseSensitive=false`}}
> Given dataframe:
>  {{val df = List(Map("a" -> 1), Map("A" -> 2)).toDF("m")}}
> And executing any of these will only return one row: case insensitive in the 
> name of the column but case sensitive match in the keys of the map.
> {{df.filter($"M.A".isNotNull).count}}
>  {{df.filter($"M"("A").isNotNull).count 
> df.filter($"M".getField("A").isNotNull).count}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2019-06-21 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869879#comment-16869879
 ] 

Marco Gaido commented on SPARK-28024:
-

[~joshrosen] thanks for linking them, Yes, I did try having them in, because it 
is a very confusing and unexpected behavior for many users, especially when 
migrating workloads from other SQL systems. Moreover, having them as configs 
lets users choose the behavior they prefer. But I received negative feedbacks 
on them as you can see. I hope that since there have been several other people 
sustaining this is a problem, those PRs may be reconsidered. Moreover, now 
we're approaching 3.0, so it may be a good moment for them. I am updating them 
resolving conflicts. This issue may be closed as a duplicate IMHO. Thanks.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> {code}
> Case 4:
> {code:sql}
> spark-sql> select exp(-1.2345678901234E200);
> 0.0
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851896#comment-16851896
 ] 

Marco Gaido commented on SPARK-24149:
-

That's true, the point is: if you want to access a different cluster, it is 
pretty natural to think that you need to add information for accessing 
credentials. But it is pretty weird if accessing different paths of the same 
cluster, you have to specify several configs.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-25 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848097#comment-16848097
 ] 

Marco Gaido commented on SPARK-24149:
-

[~Dhruve Ashar] the use case for this change, for instance, is when you have a  
partitioned table, when the partitions are on different namespaces and there is 
no viewFS configured. In that case, a user running a query on that table, may 
or may not get an exception when reading it. Please, notice that a user running 
a query may be different from the user creating it, so he/she may also not be 
aware of this situation and understanding what is the problem may be pretty 
hard.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27761) Make UDF nondeterministic by default(?)

2019-05-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843914#comment-16843914
 ] 

Marco Gaido commented on SPARK-27761:
-

Yes, I think this is a good idea IMHO. The behavior by default would be what 
most users expect.

> Make UDF nondeterministic by default(?)
> ---
>
> Key: SPARK-27761
> URL: https://issues.apache.org/jira/browse/SPARK-27761
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue as a followup from a discussion/question on this PR for an 
> optimization involving deterministic udf: 
> https://github.com/apache/spark/pull/24593#pullrequestreview-237361795  
> "We even should discuss whether all UDFs must be deterministic or 
> non-deterministic by default."
> Basically today in Spark 2.4,  Scala UDFs are marked deterministic by default 
> and it is implicit. To mark a udf as non deterministic, they need to call 
> this method asNondeterministic().
> The concern's expressed are that users are not aware of this property and its 
> implications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840242#comment-16840242
 ] 

Marco Gaido commented on SPARK-27684:
-

I can try and work on it, but most likely I will start working on it next week.

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839297#comment-16839297
 ] 

Marco Gaido commented on SPARK-27684:
-

I agree on this too.

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27685) `union` doesn't promote non-nullable columns of struct to nullable

2019-05-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839295#comment-16839295
 ] 

Marco Gaido commented on SPARK-27685:
-

This is a duplicate of SPARK-26812.

> `union` doesn't promote non-nullable columns of struct to nullable
> --
>
> Key: SPARK-27685
> URL: https://issues.apache.org/jira/browse/SPARK-27685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>  Labels: correctness
>
> When doing a {{union}} of two dataframes, a column that is nullable in one of 
> the dataframes will be nullable in the union, promoting the non-nullable one 
> to be nullable. 
> This doesn't happen properly for columns nested as subcolumns of a 
> {{struct}}. It seems to just take the nullability of the first dataframe in 
> the union, meaning a nullable column will become non-nullable, resulting in 
> invalid values.
> {code:scala}
> case class X(x: Option[Long])
> case class Nested(nested: X)
> // First, just work with normal columns
> val df1 = Seq(1L, 2L).toDF("x")
> val df2 = Seq(Some(3L), None).toDF("x")
> df1.printSchema
> // root
> //  |-- x: long (nullable = false)
> df2.printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).as[X].collect
> // res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))
> (df1 union df2).select("*").show
> // ++
> // |   x|
> // ++
> // |   1|
> // |   2|
> // |   3|
> // |null|
> // ++
> // Now, the same with the 'x' column within a struct:
> val struct1 = df1.select(struct('x) as "nested")
> val struct2 = df2.select(struct('x) as "nested")
> struct1.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> struct2.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> // BAD: the x column is not nullable
> (struct1 union struct2).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> // BAD: the last x value became "Some(0)", instead of "None"
> (struct1 union struct2).as[Nested].collect
> // res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), 
> Nested(X(Some(3))), Nested(X(Some(0
> // BAD: showing just the nested columns throws a NPE
> (struct1 union struct2).select("nested.*").show
> // java.lang.NullPointerException
> //  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
> //  at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
> // ...
> //  at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
> //  ... 49 elided
> // Flipping the order makes x nullable as desired
> (struct2 union struct1).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> (struct2 union struct1).as[Y].collect
> // res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), 
> Y(X(Some(2
> (struct2 union struct1).select("nested.*").show
> // ++
> // |   x|
> // ++
> // |   3|
> // |null|
> // |   1|
> // |   2|
> // ++
> {code}
> Note the three {{BAD}} lines, where the union of structs became non-nullable 
> and resulted in invalid values and exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27685) `union` doesn't promote non-nullable columns of struct to nullable

2019-05-14 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-27685.
-
Resolution: Duplicate

> `union` doesn't promote non-nullable columns of struct to nullable
> --
>
> Key: SPARK-27685
> URL: https://issues.apache.org/jira/browse/SPARK-27685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>  Labels: correctness
>
> When doing a {{union}} of two dataframes, a column that is nullable in one of 
> the dataframes will be nullable in the union, promoting the non-nullable one 
> to be nullable. 
> This doesn't happen properly for columns nested as subcolumns of a 
> {{struct}}. It seems to just take the nullability of the first dataframe in 
> the union, meaning a nullable column will become non-nullable, resulting in 
> invalid values.
> {code:scala}
> case class X(x: Option[Long])
> case class Nested(nested: X)
> // First, just work with normal columns
> val df1 = Seq(1L, 2L).toDF("x")
> val df2 = Seq(Some(3L), None).toDF("x")
> df1.printSchema
> // root
> //  |-- x: long (nullable = false)
> df2.printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).as[X].collect
> // res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))
> (df1 union df2).select("*").show
> // ++
> // |   x|
> // ++
> // |   1|
> // |   2|
> // |   3|
> // |null|
> // ++
> // Now, the same with the 'x' column within a struct:
> val struct1 = df1.select(struct('x) as "nested")
> val struct2 = df2.select(struct('x) as "nested")
> struct1.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> struct2.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> // BAD: the x column is not nullable
> (struct1 union struct2).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> // BAD: the last x value became "Some(0)", instead of "None"
> (struct1 union struct2).as[Nested].collect
> // res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), 
> Nested(X(Some(3))), Nested(X(Some(0
> // BAD: showing just the nested columns throws a NPE
> (struct1 union struct2).select("nested.*").show
> // java.lang.NullPointerException
> //  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
> //  at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
> // ...
> //  at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
> //  ... 49 elided
> // Flipping the order makes x nullable as desired
> (struct2 union struct1).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> (struct2 union struct1).as[Y].collect
> // res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), 
> Y(X(Some(2
> (struct2 union struct1).select("nested.*").show
> // ++
> // |   x|
> // ++
> // |   3|
> // |null|
> // |   1|
> // |   2|
> // ++
> {code}
> Note the three {{BAD}} lines, where the union of structs became non-nullable 
> and resulted in invalid values and exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF

2019-05-09 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836486#comment-16836486
 ] 

Marco Gaido commented on SPARK-26182:
-

Actually you just need to mark it {{asNondetermistic}} to avoid the double 
execution.

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: bupt_ljy
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27089) Loss of precision during decimal division

2019-05-09 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-27089.
-
Resolution: Information Provided

> Loss of precision during decimal division
> -
>
> Key: SPARK-27089
> URL: https://issues.apache.org/jira/browse/SPARK-27089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: ylo0ztlmtusq
>Priority: Major
>
> Spark looses decimal places when dividing decimal numbers.
>  
> Expected behavior (In Spark 2.2.3 or before)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Current behavior (In Spark 2.3.2 and later)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Seems to caused by {{promote_precision(38, 6) }}
>  
> {code:java}
> scala> spark.sql(sql).explain(true)
> == Parsed Logical Plan ==
> Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as 
> decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Analyzed Logical Plan ==
> val: decimal(38,14)
> Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) 
> as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as 
> decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Optimized Logical Plan ==
> Project [0.33 AS val#20]
> +- OneRowRelation
> == Physical Plan ==
> *(1) Project [0.33 AS val#20]
> +- Scan OneRowRelation[]
> {code}
>  
> Source https://stackoverflow.com/q/55046492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831097#comment-16831097
 ] 

Marco Gaido commented on SPARK-27612:
-

I don't have a python3 env, sorry...

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> This seems to only affect Python 3.
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27332) Filter Pushdown duplicates expensive ScalarSubquery (discarding result)

2019-05-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830947#comment-16830947
 ] 

Marco Gaido commented on SPARK-27332:
-

[~dzklip] actually Spark was not using the ScalarSubquery result for filtering 
on the datasource also before that PR. Hence you had the execution performed 
twice but with no gain on datasource filtering. So we decided to fix this 
execution performed twice first removing that issue and then in SPARK-26893, 
the ability to use the subquery result in partition filters was added.

> Filter Pushdown duplicates expensive ScalarSubquery (discarding result)
> ---
>
> Key: SPARK-27332
> URL: https://issues.apache.org/jira/browse/SPARK-27332
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: David Stewart Zink
>Priority: Major
>
> Test Query:  SELECT a,b FROM r0 WHERE a <= (SELECT AVG(a) from r1) LIMIT 5000
> r0,r1 have schema \{a:Double, b:String}
> IF r0 is a CsvRelation than no problem, but if it is a ParquetRelation then 
> the table scan is performed twice and the AVG(a) is simply discarded in one 
> branch. This appears to have something to do with pushing the filter down 
> into the table scan.
> Breakpoints in org.apache.spark.sql.execution.ScalarSubquery's methods 
> updateResult(), eval(), and doGenCode() seems sufficient to demonstrate that 
> one of the two computed scalars is never used.
> The duplicated ScalarSubquery can be seen in this (physical) plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project ['a, 'b]
> +- 'Filter ('a <= scalar-subquery#8 [])
>:  +- 'Project [unresolvedalias('AVG('a), None)]
>: +- 'UnresolvedRelation `r1`
>+- 'UnresolvedRelation `r0`
> == Analyzed Logical Plan ==
> a: double, b: string
> Project [a#9, b#10]
> +- Filter (a#9 <= scalar-subquery#8 [])
>:  +- Aggregate [avg(a#15) AS avg(a)#22]
>: +- SubqueryAlias `r1`
>:+- Project [a#15, b#16]
>:   +- Relation[a#15,b#16] 
> CsvRelation(,Some(/tmp/r1),true,,,",\,#,PERMISSIVE,COMMONS,false,false,true,StructType(StructField(a,DoubleType,false),
>  StructField(b,StringType,true)),false,null,,null)
>+- SubqueryAlias `r0`
>   +- Project [a#9, b#10]
>  +- Relation[a#9,b#10] parquet
> == Optimized Logical Plan ==
> Filter (isnotnull(a#9) && (a#9 <= scalar-subquery#8 []))
> :  +- Aggregate [avg(a#15) AS avg(a)#22]
> : +- Project [a#15]
> :+- Relation[a#15,b#16] 
> CsvRelation(,Some(/tmp/r1),true,,,",\,#,PERMISSIVE,COMMONS,false,false,true,StructType(StructField(a,DoubleType,false),
>  StructField(b,StringType,true)),false,null,,null)
> +- Relation[a#9,b#10] parquet
> == Physical Plan ==
> *(1) Project [a#9, b#10]
> +- *(1) Filter (isnotnull(a#9) && (a#9 <= Subquery subquery8))
>:  +- Subquery subquery8
>: +- *(2) HashAggregate(keys=[], functions=[avg(a#15)], 
> output=[avg(a)#22])
>:+- Exchange SinglePartition
>:   +- *(1) HashAggregate(keys=[], functions=[partial_avg(a#15)], 
> output=[sum#27, count#28L])
>:  +- *(1) Scan 
> CsvRelation(,Some(/tmp/r1),true,,,",\,#,PERMISSIVE,COMMONS,false,false,true,StructType(StructField(a,DoubleType,false),
>  StructField(b,StringType,true)),false,null,,null) [a#15] PushedFilters: [], 
> ReadSchema: struct
>+- *(1) FileScan parquet [a#9,b#10] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/r0], PartitionFilters: [], 
> PushedFilters: [IsNotNull(a)], ReadSchema: struct
>  +- Subquery subquery8
> +- *(2) HashAggregate(keys=[], functions=[avg(a#15)], 
> output=[avg(a)#22])
>+- Exchange SinglePartition
>   +- *(1) HashAggregate(keys=[], 
> functions=[partial_avg(a#15)], output=[sum#27, count#28L])
>  +- *(1) Scan 
> CsvRelation(,Some(/tmp/r1),true,,,",\,#,PERMISSIVE,COMMONS,false,false,true,StructType(StructField(a,DoubleType,false),
>  StructField(b,StringType,true)),false,null,,null) [a#15] PushedFilters: [], 
> ReadSchema: struct
> {noformat}
> Whereas when I change r0 to be a CsvRelation:
> {noformat}
> == Parsed Logical Plan ==
> 'Project ['a, 'b]
> +- 'Filter ('a <= scalar-subquery#0 [])
>:  +- 'Project [unresolvedalias('AVG('a), None)]
>: +- 'UnresolvedRelation `r1`
>+- 'UnresolvedRelation `r0`
> == Analyzed Logical Plan ==
> a: double, b: string
> Project [a#1, b#2]
> +- Filter (a#1 <= scalar-subquery#0 [])
>:  +- Aggregate [avg(a#7) AS avg(a)#14]
>: +- SubqueryAlias `r1`
>:+- Project [a#7, b#8]
>:   +- Relation[a#7,b#8] 
> CsvRelation(,Some(/tmp/r1),true,,,",\,#,PERMISSIVE,COMMONS,false,false,true,StructType(StructField(a,DoubleType,false),
>  

[jira] [Commented] (SPARK-27612) Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays of None

2019-05-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830945#comment-16830945
 ] 

Marco Gaido commented on SPARK-27612:
-

I am not able to reproduce...

{code}

 __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT
 /_/

Using Python version 2.7.10 (default, Oct 6 2017 22:29:07)
SparkSession available as 'spark'.
>>> from pyspark.sql.types import ArrayType, IntegerType 
>>> df = spark.createDataFrame([[1, 2, 3, 4]] * 100, ArrayType(IntegerType(), 
>>> True)) 
>>> df.distinct().collect() 
[Row(value=[1, 2, 3, 4])] 
>>>

{code}

> Creating a DataFrame in PySpark with ArrayType produces some Rows with Arrays 
> of None
> -
>
> Key: SPARK-27612
> URL: https://issues.apache.org/jira/browse/SPARK-27612
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> When creating a DataFrame with type {{ArrayType(IntegerType(), True)}} there 
> ends up being rows that are filled with None.
>  
> {code:java}
> In [1]: from pyspark.sql.types import ArrayType, IntegerType  
>    
> In [2]: df = spark.createDataFrame([[1, 2, 3, 4]] * 100, 
> ArrayType(IntegerType(), True))     
> In [3]: df.distinct().collect()   
>    
> Out[3]: [Row(value=[None, None, None, None]), Row(value=[1, 2, 3, 4])]
> {code}
>  
> From this example, it is consistently at elements 97, 98:
> {code:python}
> In [5]: df.collect()[-5:] 
>    
> Out[5]: 
> [Row(value=[1, 2, 3, 4]),
>  Row(value=[1, 2, 3, 4]),
>  Row(value=[None, None, None, None]),
>  Row(value=[None, None, None, None]),
>  Row(value=[1, 2, 3, 4])]
> {code}
> This also happens with a type of {{ArrayType(ArrayType(IntegerType(), True))}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27607) Improve performance of Row.toString()

2019-05-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830933#comment-16830933
 ] 

Marco Gaido commented on SPARK-27607:
-

Hi [~joshrosen], are you working on it? If not I can take it. Thanks.

> Improve performance of Row.toString()
> -
>
> Key: SPARK-27607
> URL: https://issues.apache.org/jira/browse/SPARK-27607
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Minor
>
> I have a job which ends up calling {{org.apache.spark.sql.Row.toString}} on 
> every row in a massive dataset (the reasons for this are slightly odd and 
> it's a bit non-trivial to change the job to avoid this step). 
> {{Row.toString}} is implemented by first constructing a WrappedArray 
> containing the Row's values (by calling {{toSeq}}) and then turning that 
> array into a string with {{mkString}}. We might be able to get a small 
> performance win by pipelining these steps, using an imperative loop to append 
> fields to a StringBuilder as soon as they're retrieved (thereby cutting out a 
> few layers of Scala collections indirection).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27287) PCAModel.load() does not honor spark configs

2019-04-23 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823876#comment-16823876
 ] 

Marco Gaido commented on SPARK-27287:
-

[~dharmesh.kakadia] the point is: if you set a config on the \{{SparkSession}}, 
that is not "copied" to the \{{SparkContext}} until the first SQL job occurs. 
Since the PCAModel.load method uses the sparkContext, if you do not submit a 
SQL job before calling it, all the configuration set in the SQL session are 
ignored. Above I suggested 2 ideas to fix this on Spark side and avoid such a 
problem.

> PCAModel.load() does not honor spark configs
> 
>
> Key: SPARK-27287
> URL: https://issues.apache.org/jira/browse/SPARK-27287
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> PCAModel.load() does not seem to be using the configurations set on the 
> current spark session. 
> Repro:
>  
> The following will fail to read the data because the storage account 
> credentials config used/propagated. 
> conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")
> spark = 
> SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()
> model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')
>  
> The following however works:
> conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")
> spark = 
> SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()
> blah = 
> spark.read.json('wasb://t...@test.blob.core.windows.net/somethingelse/')
> blah.show()
> model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')
>  
> It looks like spark.read...() does force the use of the config once and then 
> PCAModel.load() will work correctly. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26218) Throw exception on overflow for integers

2019-04-12 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16816087#comment-16816087
 ] 

Marco Gaido commented on SPARK-26218:
-

[~rxin] I see that. But the reason for this are:

 - the current behavior is against the SQL standard;
 - having incorrect results in some use cases is worse than having a job 
failure;
 - introducing a config which allows to choose which behavior to have allows 
users to decide which is the best option for their specific use cases. Moreover 
I can also see people maybe having the check for overflow turned on in dev 
environments, so that they can find issues and turned off in prod in order to 
avoid wasting hours as you mentioned.

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27278) Optimize GetMapValue when the map is a foldable and the key is not

2019-04-07 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811808#comment-16811808
 ] 

Marco Gaido commented on SPARK-27278:
-

[~huonw] I think the point is: in the existing case which is optimized, the 
{{CreateMap}} operation is removed and replaced with the {{CASE ... WHEN}} 
syntax: this is fine because the code generated by the {{CreateMap}} operation 
is linear in size with the number of elements, exactly as for the {{CASE ... 
WHEN}} approach. So the overall code generated size is similar. While when the 
{{CreateMap}} operation is replaced with a {{Literal}}, this code is not there 
and we replace the for loop present with the list of if statements. So the code 
size is significantly bigger. Despite trivial tests show that the second case 
is faster, having a bigger code generated can lead to huge perf issues in more 
complex scenario (the worts cases are probably when it causes the method size 
to grow bigger than 4k so no JIT is performed anymore, or even it may cause the 
failure to use WholeStageCodegen): so the effect of introducing it in 
performance would be highly query-dependent and hard to predict.

> Optimize GetMapValue when the map is a foldable and the key is not
> --
>
> Key: SPARK-27278
> URL: https://issues.apache.org/jira/browse/SPARK-27278
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4.0
>Reporter: Huon Wilson
>Priority: Minor
>
> With a map that isn't constant-foldable, spark will optimise an access to a 
> series of {{CASE WHEN ... THEN ... WHEN ... THEN ... END}}, for instance
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), 'id)('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [CASE WHEN (cast(id#180L as int) = 1) THEN 1 WHEN (cast(id#180L 
> as int) = 2) THEN id#180L END AS x#182L]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> This results in an efficient series of ifs and elses, in the code generation:
> {code:java}
> /* 037 */   boolean project_isNull_3 = false;
> /* 038 */   int project_value_3 = -1;
> /* 039 */   if (!false) {
> /* 040 */ project_value_3 = (int) project_expr_0_0;
> /* 041 */   }
> /* 042 */
> /* 043 */   boolean project_value_2 = false;
> /* 044 */   project_value_2 = project_value_3 == 1;
> /* 045 */   if (!false && project_value_2) {
> /* 046 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 047 */ project_project_value_1_0 = 1L;
> /* 048 */ continue;
> /* 049 */   }
> /* 050 */
> /* 051 */   boolean project_isNull_8 = false;
> /* 052 */   int project_value_8 = -1;
> /* 053 */   if (!false) {
> /* 054 */ project_value_8 = (int) project_expr_0_0;
> /* 055 */   }
> /* 056 */
> /* 057 */   boolean project_value_7 = false;
> /* 058 */   project_value_7 = project_value_8 == 2;
> /* 059 */   if (!false && project_value_7) {
> /* 060 */ project_caseWhenResultState_0 = (byte)(false ? 1 : 0);
> /* 061 */ project_project_value_1_0 = project_expr_0_0;
> /* 062 */ continue;
> /* 063 */   }
> {code}
> If the map can be constant folded, the constant folding happens first, and 
> the {{SimplifyExtractValueOps}} optimisation doesn't trigger, resulting doing 
> a map traversal and more dynamic checks:
> {code:none}
> scala> spark.range(1000).select(map(lit(1), lit(1), lit(2), lit(2))('id) as 
> "x").explain
> == Physical Plan ==
> *(1) Project [keys: [1,2], values: [1,2][cast(id#195L as int)] AS x#197]
> +- *(1) Range (0, 1000, step=1, splits=12)
> {code}
> The {{keys: ..., values: ...}} is from the {{ArrayBasedMapData}} type, which 
> is what is stored in the {{Literal}} form of the {{map(...)}} expression in 
> that select. The code generated is less efficient, since it has to do a 
> manual dynamic traversal of the map's array of keys, with type casts etc.:
> {code:java}
> /* 099 */   int project_index_0 = 0;
> /* 100 */   boolean project_found_0 = false;
> /* 101 */   while (project_index_0 < project_length_0 && 
> !project_found_0) {
> /* 102 */ final int project_key_0 = 
> project_keys_0.getInt(project_index_0);
> /* 103 */ if (project_key_0 == project_value_2) {
> /* 104 */   project_found_0 = true;
> /* 105 */ } else {
> /* 106 */   project_index_0++;
> /* 107 */ }
> /* 108 */   }
> /* 109 */
> /* 110 */   if (!project_found_0) {
> /* 111 */ project_isNull_0 = true;
> /* 112 */   } else {
> /* 113 */ project_value_0 = 
> project_values_0.getInt(project_index_0);
> /* 114 */   }
> {code}
> It 

[jira] [Commented] (SPARK-27287) PCAModel.load() does not honor spark configs

2019-04-02 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807780#comment-16807780
 ] 

Marco Gaido commented on SPARK-27287:
-

I think the problem here is that the configuration are copied from the 
SparkSession to the SparkContext when the SparkSession is used, but they are 
not if the SparkContext is used (as it is done when loading a model in ML). I 
think the solutions here may be 2: fix the way contexts are handling properties 
(preferred); use the sparkSession when reading ML models (easier workaround).

> PCAModel.load() does not honor spark configs
> 
>
> Key: SPARK-27287
> URL: https://issues.apache.org/jira/browse/SPARK-27287
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Dharmesh Kakadia
>Priority: Major
>
> PCAModel.load() does not seem to be using the configurations set on the 
> current spark session. 
> Repro:
>  
> The following will fail to read the data because the storage account 
> credentials config used/propagated. 
> conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")
> spark = 
> SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()
> model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')
>  
> The following however works:
> conf.set("fs.azure.account.key.test.blob.core.windows.net","Xosad==")
> spark = 
> SparkSession.builder.appName("dharmesh").config(conf=conf).master('spark://spark-master:7077').getOrCreate()
> blah = 
> spark.read.json('wasb://t...@test.blob.core.windows.net/somethingelse/')
> blah.show()
> model = PCAModel.load('wasb://t...@test.blob.core.windows.net/model')
>  
> It looks like spark.read...() does force the use of the config once and then 
> PCAModel.load() will work correctly. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-28 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16803767#comment-16803767
 ] 

Marco Gaido commented on SPARK-27283:
-

{quote}
I guess I'm mostly frustrated that the SQL standard leaves decimal and 
floating-point arithmetics implementation-specific.
{quote}

Yes, I think so.

> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  I would argue that this is a bug, and that the correct thing to do would be 
> to lift the result to a decimal also when one operand is a double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801930#comment-16801930
 ] 

Marco Gaido commented on SPARK-27282:
-

Please do not use Blocker/Critical as they are reserved for committers. This 
bug has been already fixed I think: I am not able to reproduce on master. May 
you please try as well? Unfortunately I am not sure which JIRA actually fixed 
this behavior though...

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2019-03-26 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-27282:

Priority: Major  (was: Blocker)

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27283) BigDecimal arithmetic losing precision

2019-03-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801778#comment-16801778
 ] 

Marco Gaido commented on SPARK-27283:
-

[~Mats_SX] another issue which could happen using Decimal instead of double is 
not being able to represent very big values (Spark big decimal max precision is 
38).

> BigDecimal arithmetic losing precision
> --
>
> Key: SPARK-27283
> URL: https://issues.apache.org/jira/browse/SPARK-27283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mats
>Priority: Minor
>  Labels: decimal, float, sql
>
> When performing arithmetics between doubles and decimals, the resulting value 
> is always a double. This is very strange to me; when an exact type is present 
> as one of the inputs, I would expect that the inexact type is lifted and the 
> result presented exactly, rather than lowering the exact type to the inexact 
> and presenting a result that may contain rounding errors. The choice to use a 
> decimal was probably taken because rounding errors were deemed an issue.
> When performing arithmetics between decimals and integers, the expected 
> behaviour is seen; the result is a decimal.
> See the following example:
> {code:java}
> import org.apache.spark.sql.functions
> val df = sparkSession.createDataFrame(Seq(Tuple1(0L))).toDF("a")
> val decimalInt = df.select(functions.lit(BigDecimal(3.14)) + functions.lit(1) 
> as "d")
> val decimalDouble = df.select(functions.lit(BigDecimal(3.14)) + 
> functions.lit(1.0) as "d")
> decimalInt.schema.printTreeString()
> decimalInt.show()
> decimalDouble.schema.printTreeString()
> decimalDouble.show(){code}
> which produces this output (with possible variation on the rounding error):
> {code:java}
> root
> |-- d: decimal(4,2) (nullable = true)
> ++
> | d  |
> ++
> |4.14|
> ++
> root
> |-- d: double (nullable = false)
> +-+
> | d   |
> +-+
> |4.141|
> +-+
> {code}
>  
>  I would argue that this is a bug, and that the correct thing to do would be 
> to lift the result to a decimal also when one operand is a double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27243) RuleExecutor throws exception when dumping time spent with no rule executed

2019-03-22 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-27243:
---

 Summary: RuleExecutor throws exception when dumping time spent 
with no rule executed
 Key: SPARK-27243
 URL: https://issues.apache.org/jira/browse/SPARK-27243
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


RuleExecutor can dump the time spent in the analyzer/optimizer rules. When no 
rule is executed or the RuleExecutor has just been reset this results in an 
exception, rather than returning an empty summary, which should be the result 
of this operation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27193) CodeFormatter should format multi comment lines correctly

2019-03-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-27193:

Priority: Trivial  (was: Major)

> CodeFormatter should format multi comment lines correctly
> -
>
> Key: SPARK-27193
> URL: https://issues.apache.org/jira/browse/SPARK-27193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Trivial
>
> when enable `spark.sql.codegen.comments`,  there will be multiple comment 
> lines. However, CodeFormatter can not handle multi comment lines currently:
>  
> Generated code:
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIteratorForCodegenStage1(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ /**
> \* Codegend pipeline for stage (id=1)
> \* *(1) Project [(id#0L + 1) AS (id + 1)#3L]
> \* +- *(1) Filter (id#0L = 1)
> \*+- *(1) Range (0, 10, step=1, splits=4)
> \*/
> /* 006 */ // codegenStageId=1
> /* 007 */ final class GeneratedIteratorForCodegenStage1 extends 
> org.apache.spark.sql.execution.BufferedRowIterator {



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27089) Loss of precision during decimal division

2019-03-18 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16794984#comment-16794984
 ] 

Marco Gaido commented on SPARK-27089:
-

You can set: {{spark.sql.decimalOperations.allowPrecisionLoss}} to {{false}} 
and you will get the original behavior. Please see SPARK-22036 for more details.

> Loss of precision during decimal division
> -
>
> Key: SPARK-27089
> URL: https://issues.apache.org/jira/browse/SPARK-27089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: ylo0ztlmtusq
>Priority: Major
>
> Spark looses decimal places when dividing decimal numbers.
>  
> Expected behavior (In Spark 2.2.3 or before)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> 19/03/07 21:23:51 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Current behavior (In Spark 2.3.2 and later)
>  
> {code:java}
> scala> val sql = """select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val"""
> sql: String = select cast(cast(3 as decimal(38,14)) / cast(9 as 
> decimal(38,14)) as decimal(38,14)) val
> scala> spark.sql(sql).show
> ++
> | val|
> ++
> |0.33|
> ++
> {code}
>  
> Seems to caused by {{promote_precision(38, 6) }}
>  
> {code:java}
> scala> spark.sql(sql).explain(true)
> == Parsed Logical Plan ==
> Project [cast((cast(3 as decimal(38,14)) / cast(9 as decimal(38,14))) as 
> decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Analyzed Logical Plan ==
> val: decimal(38,14)
> Project [cast(CheckOverflow((promote_precision(cast(cast(3 as decimal(38,14)) 
> as decimal(38,14))) / promote_precision(cast(cast(9 as decimal(38,14)) as 
> decimal(38,14, DecimalType(38,6)) as decimal(38,14)) AS val#20]
> +- OneRowRelation
> == Optimized Logical Plan ==
> Project [0.33 AS val#20]
> +- OneRowRelation
> == Physical Plan ==
> *(1) Project [0.33 AS val#20]
> +- Scan OneRowRelation[]
> {code}
>  
> Source https://stackoverflow.com/q/55046492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-03-07 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16786984#comment-16786984
 ] 

Marco Gaido commented on SPARK-27018:
-

The PeriodicCheckpointer is still there in master, you can check it on github: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/PeriodicCheckpointer.scala.
 You can just check file histories in git in order to know what and how was 
changed. I'd recommend you, anyway, to test whether current master is still 
affected by this issue as it may have been fixed since 2.2.

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)

[jira] [Commented] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-03-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781657#comment-16781657
 ] 

Marco Gaido commented on SPARK-27018:
-

First the issue is fixed on master and then it is backported to the other 
branches (if there are no conflicts, the committer will do it while merging to 
master). Anyway, 2.2 is EOL, so it will be eventually backported to 2.3/2.4.

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)

[jira] [Commented] (SPARK-27018) Checkpointed RDD deleted prematurely when using GBTClassifier

2019-03-01 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781558#comment-16781558
 ] 

Marco Gaido commented on SPARK-27018:
-

[~pkolaczk] thanks for reporting the issue. Spark works with PRs, not patches. 
Can you please submit a PR for master branch? Thanks.

> Checkpointed RDD deleted prematurely when using GBTClassifier
> -
>
> Key: SPARK-27018
> URL: https://issues.apache.org/jira/browse/SPARK-27018
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.2, 2.2.3, 2.3.3, 2.4.0
> Environment: OS: Ubuntu Linux 18.10
> Java: java version "1.8.0_201"
> Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
> Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
> Reproducible with a single-node Spark in standalone mode.
> Reproducible with Zepellin or Spark shell.
>  
>Reporter: Piotr Kołaczkowski
>Priority: Major
> Attachments: 
> Fix_check_if_the_next_checkpoint_exists_before_deleting_the_old_one.patch
>
>
> Steps to reproduce:
> {noformat}
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.classification.GBTClassifier
> case class Row(features: org.apache.spark.ml.linalg.Vector, label: Int)
> sc.setCheckpointDir("/checkpoints")
> val trainingData = sc.parallelize(1 to 2426874, 256).map(x => 
> Row(Vectors.dense(x, x + 1, x * 2 % 10), if (x % 5 == 0) 1 else 0)).toDF
> val classifier = new GBTClassifier()
>   .setLabelCol("label")
>   .setFeaturesCol("features")
>   .setProbabilityCol("probability")
>   .setMaxIter(100)
>   .setMaxDepth(10)
>   .setCheckpointInterval(2)
> classifier.fit(trainingData){noformat}
>  
> The last line fails with:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 56.0 failed 10 times, most recent failure: Lost task 0.9 in stage 56.0 
> (TID 12058, 127.0.0.1, executor 0): java.io.FileNotFoundException: 
> /checkpoints/191c9209-0955-440f-8c11-f042bdf7f804/rdd-51
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:63)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$1.applyOrElse(DseFileSystem.scala:61)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem.com$datastax$bdp$fs$hadoop$DseFileSystem$$translateToHadoopExceptions(DseFileSystem.scala:70)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFileSystem$$anonfun$6.apply(DseFileSystem.scala:264)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.input(DseFsInputStream.scala:31)
> at 
> com.datastax.bdp.fs.hadoop.DseFsInputStream.openUnderlyingDataSource(DseFsInputStream.scala:39)
> at com.datastax.bdp.fs.hadoop.DseFileSystem.open(DseFileSystem.scala:269)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD$.readCheckpointFile(ReliableCheckpointRDD.scala:292)
> at 
> org.apache.spark.rdd.ReliableCheckpointRDD.compute(ReliableCheckpointRDD.scala:100)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:322)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
> at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
> at 
> 

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778485#comment-16778485
 ] 

Marco Gaido commented on SPARK-26996:
-

Thanks [~dongjoon]!

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16778453#comment-16778453
 ] 

Marco Gaido commented on SPARK-26996:
-

I have not been able to reproduce on current master though... I'll try and 
repro on the 2.4 branch. In case the problem is still there, we might want to 
find the patch and backport it.

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

2019-02-26 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-26996:

Component/s: (was: Spark Core)
 SQL

> Scalar Subquery not handled properly in Spark 2.4 
> --
>
> Key: SPARK-26996
> URL: https://issues.apache.org/jira/browse/SPARK-26996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ilya Peysakhov
>Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> ---
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26974) Invalid data in grouped cached dataset, formed by joining a large cached dataset with a small dataset

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1686#comment-1686
 ] 

Marco Gaido commented on SPARK-26974:
-

Can you please try a newer Spark version (2.4.0)? If the problem is still 
present, can you please provide a simple reproducer (ie. 2 files with sample 
data which still produces the issue and the exact code reproducing it)?

> Invalid data in grouped cached dataset, formed by joining a large cached 
> dataset with a small dataset
> -
>
> Key: SPARK-26974
> URL: https://issues.apache.org/jira/browse/SPARK-26974
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.2.0
>Reporter: Utkarsh Sharma
>Priority: Major
>  Labels: caching, data-corruption, data-integrity
>
> The initial datasets are derived from hive tables using the spark.table() 
> functions.
> Dataset descriptions:
> *+Sales+* dataset (close to 10 billion rows) with the following columns (and 
> sample rows) : 
> ||ItemId (bigint)||CustomerId (bigint)||qty_sold (bigint)||
> |1|1|20|
> |1|2|30|
> |2|1|40|
>  
> +*Customer*+ Dataset (close to 5 rows) with the following columns (and 
> sample rows):
> ||CustomerId (bigint)||CustomerGrpNbr (smallint)||
> |1|1|
> |2|2|
> |3|1|
>  
> I am doing the following steps:
>  # Caching sales dataset with close to 10 billion rows.
>  # Doing an inner join of 'sales' with 'customer' dataset
>  
>  # Doing group by on the resultant dataset, based on CustomerGrpNbr column to 
> get sum(qty_sold) and stddev(qty_sold) vales in the customer groups.
>  # Caching the resultant grouped dataset.
>  # Doing a .count() on the grouped dataset.
> The step 5 count is supposed to return only 20, because when you do a 
> customer.select("CustomerGroupNbr").distinct().count you get 20 values. 
> However, you get a value of around 65,000 in step 5.
> Following are the commands I am running in spark-shell:
> {code:java}
> var sales = spark.table("sales_table")
> var customer = spark.table("customer_table")
> var finalDf = sales.join(customer, 
> "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
> stddev("qty_sold"))
> sales.cache()
> finalDf.cache()
> finalDf.count() // returns around 65k rows and the count keeps on varying 
> each run
> customer.select("CustomerGrpNbr").distinct().count() //returns 20{code}
> I have been able to replicate the same behavior using the java api as well. 
> This anamolous behavior disappears however, when I remove the caching 
> statements. I.e. if i run the following in spark-shell, it works as expected:
> {code:java}
> var sales = spark.table("sales_table")
> var customer = spark.table("customer_table")
> var finalDf = sales.join(customer, 
> "CustomerId").groupBy("CustomerGrpNbr").agg(sum("qty_sold"), 
> stddev("qty_sold")) 
> finalDf.count() // returns 20 
> customer.select("CustomerGrpNbr").distinct().count() //returns 20
> {code}
> The tables in hive from which the datasets are built do not change during 
> this entire process. So why does the caching cause this problem?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1665#comment-1665
 ] 

Marco Gaido commented on SPARK-26947:
-

Cloud you also please provide the heap dump of the JVM? You can use 
{{-XX:+HeapDumpOnOutOfMemoryError}} in order to achieve that, passing it to the 
java options.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-26988) Spark overwrites spark.scheduler.pool if set in configs

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1655#comment-1655
 ] 

Marco Gaido commented on SPARK-26988:
-

This seems indeed an issue for any property set using `sc.setLocalProperty`, as 
they are not tracked in the session state. This may cause regressions actually. 
cc [~cloud_fan] who worked on this. I cannot think of a good solution right 
now. A possible approach would be to introduce kind of a callback in order to 
put in the session state the configs set directly in the SparkContext, but it 
is not a clean solution.

> Spark overwrites spark.scheduler.pool if set in configs
> ---
>
> Key: SPARK-26988
> URL: https://issues.apache.org/jira/browse/SPARK-26988
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.4.0
>Reporter: Dave DeCaprio
>Priority: Minor
>
> If you set a default spark.scheduler.pool in your configuration when you 
> create a SparkSession and then you attempt to override that configuration by 
> calling setLocalProperty on a SparkSession, as described in the Spark 
> documentation - 
> [https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools]
>  - it won't work.
> Spark will go with the original pool name.
> I've traced this down to SQLExecution.withSQLConfPropagated, which copies any 
> key that starts with "spark" from the the session state to the local 
> properties.  The can end up overwriting the scheduler, which is set by 
> spark.scheduler.pool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26911) Spark do not see column in table

2019-02-18 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770964#comment-16770964
 ] 

Marco Gaido commented on SPARK-26911:
-

May you please check that current master is still affected? Moreover, can you 
provide a reproducer? Otherwise it is impossible to investigate the issue. 
Thanks.

> Spark do not see column in table
> 
>
> Key: SPARK-26911
> URL: https://issues.apache.org/jira/browse/SPARK-26911
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: PySpark (Spark 2.3.1)
>Reporter: Vitaly Larchenkov
>Priority: Major
>
> Spark cannot find column that actually exists in array
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; {code}
>  
>  
> {code:java}
> ---
> Py4JJavaError Traceback (most recent call last)
> /usr/share/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
> /usr/share/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 327 "An error occurred while calling {0}{1}{2}.\n".
> --> 328 format(target_id, ".", name), value)
> 329 else:
> Py4JJavaError: An error occurred while calling o35.sql.
> : org.apache.spark.sql.AnalysisException: cannot resolve '`id`' given input 
> columns: [flid.palfl_timestamp, flid.id, flid.pal_state, flid.prs_id, 
> flid.bank_id, flid.wr_id, flid.link_id]; line 10 pos 98;
> 'Project ['multiples.id, 'multiples.link_id]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26881) Scaling issue with Gramian computation for RowMatrix: too many results sent to driver

2019-02-18 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770881#comment-16770881
 ] 

Marco Gaido commented on SPARK-26881:
-

This may have been fixed/improved by SPARK-26228, could you try on current 
master? cc [~srowen]

> Scaling issue with Gramian computation for RowMatrix: too many results sent 
> to driver
> -
>
> Key: SPARK-26881
> URL: https://issues.apache.org/jira/browse/SPARK-26881
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Rafael RENAUDIN-AVINO
>Priority: Minor
>
> This issue hit me when running PCA on large dataset (~1Billion rows, ~30k 
> columns).
> Computing Gramian of a big RowMatrix allows to reproduce the issue.
>  
> The problem arises in the treeAggregate phase of the gramian matrix 
> computation: results sent to driver are enormous.
> A potential solution to this could be to replace the hard coded depth (2) of 
> the tree aggregation by a heuristic computed based on the number of 
> partitions, driver max result size, and memory size of the dense vectors that 
> are being aggregated, cf below for more detail:
> (nb_partitions)^(1/depth) * dense_vector_size <= driver_max_result_size
> I have a potential fix ready (currently testing it at scale), but I'd like to 
> hear the community opinion about such a fix to know if it's worth investing 
> my time into a clean pull request.
>  
> Note that I only faced this issue with spark 2.2 but I suspect it affects 
> later versions aswell. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26894) Fix Alias handling in AggregateEstimation

2019-02-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769845#comment-16769845
 ] 

Marco Gaido commented on SPARK-26894:
-

It will be assigned once the PR is merged. Thanks.

> Fix Alias handling in AggregateEstimation
> -
>
> Key: SPARK-26894
> URL: https://issues.apache.org/jira/browse/SPARK-26894
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> Aliases are not handled separately in AggregateEstimation similar to 
> ProjectEstimation due to which stats are not getting propagated when CBO is 
> enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26893) Allow pushdown of partition pruning subquery filters to file source

2019-02-15 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16769474#comment-16769474
 ] 

Marco Gaido commented on SPARK-26893:
-

Actually this was done intentionally in order to avoid subquery re-execution 
(please notice that subqueries can be hard to compute, so the resource wastage 
can be significant). Please see https://github.com/apache/spark/pull/22518 for 
more details. In case you do want to do so, we'd need to fix that problem in a 
similar way to what was initially proposed in the before-mentioned PR. cc 
[~cloud_fan]

> Allow pushdown of partition pruning subquery filters to file source
> ---
>
> Key: SPARK-26893
> URL: https://issues.apache.org/jira/browse/SPARK-26893
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Minor
>
> File source doesn't use subquery filters for partition pruning. But it could 
> use those filters with a minor improvement.
> This query is an example:
> {noformat}
> CREATE TABLE a (id INT, p INT) USING PARQUET PARTITIONED BY (p)
> CREATE TABLE b (id INT) USING PARQUET
> SELECT * FROM a WHERE p <= (SELECT MIN(id) FROM b){noformat}
> Where the executed plan of the SELECT currently is:
> {noformat}
> *(1) Filter (p#252L <= Subquery subquery250)
> : +- Subquery subquery250
> : +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> : +- Exchange SinglePartition
> : +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> : +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> +- *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, 
> DataFilters: [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionCount: 2, PartitionFilters: [isnotnull(p#252L)], 
> PushedFilters: [], ReadSchema: struct
> {noformat}
> But it could be: 
> {noformat}
> *(1) FileScan parquet default.a[id#251L,p#252L] Batched: true, DataFilters: 
> [], Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/a/p=0,
>  file:..., PartitionFilters: [isnotnull(p#252L), (p#252L <= Subquery 
> subquery250)], PushedFilters: [], ReadSchema: struct
> +- Subquery subquery250
> +- *(2) HashAggregate(keys=[], functions=[min(id#253L)], 
> output=[min(id)#255L])
> +- Exchange SinglePartition
> +- *(1) HashAggregate(keys=[], functions=[partial_min(id#253L)], 
> output=[min#259L])
> +- *(1) FileScan parquet default.b[id#253L] Batched: true, DataFilters: [], 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/ptoth/git2/spark2/common/kvstore/spark-warehouse/b],
>  PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> {noformat}
> and so partition pruning could work in {{FileSourceScanExec}}.
>  Please note that {{PartitionCount}} metadata can't be computed before 
> execution so in this case it is no longer part of the plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26829) In place standard scaler so the column remains same after transformation

2019-02-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16766939#comment-16766939
 ] 

Marco Gaido commented on SPARK-26829:
-

You can set the output column name and you can rename it as you want after the 
transformation. The only question which I think this ticket poses is: why are 
we currently forbidding to override an existing column in several places (eg. 
PCA, MinMaxScaler, StandardScaler)? This seems to be a consistent behavior and 
there are easy workarounds, so I am not sure if it's worth changing it.

> In place standard scaler so the column remains same after transformation
> 
>
> Key: SPARK-26829
> URL: https://issues.apache.org/jira/browse/SPARK-26829
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.2
>Reporter: Santokh Singh
>Priority: Major
>
> Standard scaler and some similar transformations takes input column name and 
> produce a new column, either accepting output column or generating new one 
> with some random name after performing transformation.
>  "inplace" flag  on true does not generate new column in output in dataframe 
> after transformation; preserves schema of df.
> "inplace" flag on false works the way its currently working.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26782) Wrong column resolved when joining twice with the same dataframe

2019-01-30 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-26782.
-
Resolution: Duplicate

> Wrong column resolved when joining twice with the same dataframe
> 
>
> Key: SPARK-26782
> URL: https://issues.apache.org/jira/browse/SPARK-26782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Vladimir Prus
>Priority: Major
>
> # Execute the following code:
>  
> {code:java}
> {
>  val events = Seq(("a", 0)).toDF("id", "ts")
>  val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
>  
>  val dimOriginal = dim.as("dim")
>  val dimShifted = dim.as("dimShifted")
> val r = events
>  .join(dimOriginal, "id")
>  .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
> val r2 = r 
>  .join(dimShifted, "id")
>  .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
>  
>  r2.show() 
>  r2.explain(true)
> }
> {code}
>  
>  # Expected effect:
>  ** One row is shown
>  ** Logical plan shows two independent joints with "dim" and "dimShifted"
>  # Observed effect:
>  ** No rows are printed.
>  ** Logical plan shows two filters are applied:
>  *** 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
>  *** Filter ((start#17 <= ts#6) && (ts#6 < end#18))
>  ** Both these filters refer to the same start#17 and start#18 columns, so 
> they are applied to the same dataframe, not two different ones.
>  ** It appears that dimShifted("start") is resolved to be identical to 
> dimOriginal("start")
>  # I get the desired effect if I replace the second where with 
> {code:java}
> .where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26782) Wrong column resolved when joining twice with the same dataframe

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755987#comment-16755987
 ] 

Marco Gaido commented on SPARK-26782:
-

This is a duplicate of many others. I also started a thread on the dev mailing 
list regarding this problem. Let me close this as a duplicate.

> Wrong column resolved when joining twice with the same dataframe
> 
>
> Key: SPARK-26782
> URL: https://issues.apache.org/jira/browse/SPARK-26782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Vladimir Prus
>Priority: Major
>
> # Execute the following code:
>  
> {code:java}
> {
>  val events = Seq(("a", 0)).toDF("id", "ts")
>  val dim = Seq(("a", 0, 24), ("a", 24, 48)).toDF("id", "start", "end")
>  
>  val dimOriginal = dim.as("dim")
>  val dimShifted = dim.as("dimShifted")
> val r = events
>  .join(dimOriginal, "id")
>  .where(dimOriginal("start") <= $"ts" && $"ts" < dimOriginal("end"))
> val r2 = r 
>  .join(dimShifted, "id")
>  .where(dimShifted("start") <= $"ts" + 24 && $"ts" + 24 < dimShifted("end"))
>  
>  r2.show() 
>  r2.explain(true)
> }
> {code}
>  
>  # Expected effect:
>  ** One row is shown
>  ** Logical plan shows two independent joints with "dim" and "dimShifted"
>  # Observed effect:
>  ** No rows are printed.
>  ** Logical plan shows two filters are applied:
>  *** 'Filter ((start#17 <= ('ts + 24)) && (('ts + 24) < end#18))'
>  *** Filter ((start#17 <= ts#6) && (ts#6 < end#18))
>  ** Both these filters refer to the same start#17 and start#18 columns, so 
> they are applied to the same dataframe, not two different ones.
>  ** It appears that dimShifted("start") is resolved to be identical to 
> dimOriginal("start")
>  # I get the desired effect if I replace the second where with 
> {code:java}
> .where($"dimShifted.start" <= $"ts" + 24 && $"ts" + 24 < $"dimShifted.end")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26779) NullPointerException when disable wholestage codegen

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755881#comment-16755881
 ] 

Marco Gaido commented on SPARK-26779:
-

I'd say this is most likely just a duplicate of SPARK-23731.

> NullPointerException when disable wholestage codegen
> 
>
> Key: SPARK-26779
> URL: https://issues.apache.org/jira/browse/SPARK-26779
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiaoju Wu
>Priority: Trivial
>
> When running TPCDSSuite with wholestage codegen disabled, NPE is thrown in q9:
> java.lang.NullPointerException at
> org.apache.spark.sql.execution.FileSourceScanExec.(DataSourceScanExec.scala:170)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:613)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doCanonicalize(DataSourceScanExec.scala:160)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized$lzycompute(QueryPlan.scala:210)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.canonicalized(QueryPlan.scala:209)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:224)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> 

[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755837#comment-16755837
 ] 

Marco Gaido commented on SPARK-25420:
-

[~jeffrey.mak] I cannot reproduce your issue on current master branch. I 
created a test.csv file with the data you provided above and run:

{code}
scala> val drkcard_0_df = spark.read.csv("test.csv")
drkcard_0_df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 4 
more fields]

scala> drkcard_0_df.show()
+---+---+---+---+++
|_c0|_c1|_c2|_c3| _c4| _c5|
+---+---+---+---+++
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...|John|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| Tom|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...|   Mabel|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|   James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|Laurence|
+---+---+---+---+++

scala> val dropDup_0 = 
drkcard_0_df.dropDuplicates(Seq("_c0","_c1","_c2","_c3","_c4"))
dropDup_0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_c0: 
string, _c1: string ... 4 more fields]

scala> dropDup_0.show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| John|
+---+---+---+---++-+


scala> dropDup_0.where("_c0='2018-09-21 00:00:00' and _c1='TDT_DSC_ITM' and 
_c2='83' and _c4='180919212732008200218'").show()
+---+---+---+---++-+
|_c0|_c1|_c2|_c3| _c4|  _c5|
+---+---+---+---++-+
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  C|18091921273200820...|James|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  B|18091921273200820...| Mary|
|2018-09-21 00:00:00|TDT_DSC_ITM| 83|  A|18091921273200820...| 

[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2019-01-30 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755831#comment-16755831
 ] 

Marco Gaido commented on SPARK-25420:
-

[~jeffrey.mak] [~kabhwan] I agree that your case seems not correct at a first 
sight. May you please post your query plan? I'll try and reproduce with the 
data and code you provided here

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26767) Filter on a dropDuplicates dataframe gives inconsistency result

2019-01-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754882#comment-16754882
 ] 

Marco Gaido edited comment on SPARK-26767 at 1/29/19 11:13 AM:
---

IIRC there was a similar JIRA reported. Maybe the problem is the same. The JIRA 
is SPARK-25420: please check the comments there. Your case may be the same.


was (Author: mgaido):
IIRC there was a similar JIRA reported. May you please try in a newer version 
(ideally current branch-2.3)? This may have been fixed.

> Filter on a dropDuplicates dataframe gives inconsistency result
> ---
>
> Key: SPARK-26767
> URL: https://issues.apache.org/jira/browse/SPARK-26767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: To repeat the problem,
> (1) create a csv file with records holding same values for a subset of 
> columns (e.g. colA, colB, colC).
> (2) read the csv file as a spark dataframe and then use dropDuplicates to 
> dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
> (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 
> 'A' and colB='B' and colG='G' and colH='H').show(100,False))
>  
> => When (3) is rerun, it gives different number of resulting rows.
>Reporter: Jeffrey
>Priority: Major
>
> To repeat the problem,
> (1) create a csv file with records holding same values for a subset of 
> columns (e.g. colA, colB, colC).
> (2) read the csv file as a spark dataframe and then use dropDuplicates to 
> dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
> (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 
> 'A' and colB='B' and colG='G' and colH='H').show(100,False))
>  
> => When (3) is rerun, it gives different number of resulting rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26767) Filter on a dropDuplicates dataframe gives inconsistency result

2019-01-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754882#comment-16754882
 ] 

Marco Gaido commented on SPARK-26767:
-

IIRC there was a similar JIRA reported. May you please try in a newer version 
(ideally current branch-2.3)? This may have been fixed.

> Filter on a dropDuplicates dataframe gives inconsistency result
> ---
>
> Key: SPARK-26767
> URL: https://issues.apache.org/jira/browse/SPARK-26767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: To repeat the problem,
> (1) create a csv file with records holding same values for a subset of 
> columns (e.g. colA, colB, colC).
> (2) read the csv file as a spark dataframe and then use dropDuplicates to 
> dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
> (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 
> 'A' and colB='B' and colG='G' and colH='H').show(100,False))
>  
> => When (3) is rerun, it gives different number of resulting rows.
>Reporter: Jeffrey
>Priority: Major
>
> To repeat the problem,
> (1) create a csv file with records holding same values for a subset of 
> columns (e.g. colA, colB, colC).
> (2) read the csv file as a spark dataframe and then use dropDuplicates to 
> dedup the subset of columns (i.e. dropDuplicates(["colA", "colB", "colC"]))
> (3) select the resulting dataframe with where clause. (i.e. df.where("colA = 
> 'A' and colB='B' and colG='G' and colH='H').show(100,False))
>  
> => When (3) is rerun, it gives different number of resulting rows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26752) Multiple aggregate methods in the same column in DataFrame

2019-01-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754719#comment-16754719
 ] 

Marco Gaido commented on SPARK-26752:
-

I agree with you [~hyukjin.kwon]. Actually I'd rather propose to deprecate and 
remove the one with {{Map[String, String]}}. Seems a subset of what is doable 
with {{(String, String)*}}, so it seems a bit useless to me. I think that would 
also prevent ticket like this one being created...

> Multiple aggregate methods in the same column in DataFrame
> --
>
> Key: SPARK-26752
> URL: https://issues.apache.org/jira/browse/SPARK-26752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Guilherme Beltramini
>Priority: Minor
>
> The agg function in 
> [org.apache.spark.sql.RelationalGroupedDataset|https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset]
>  accepts as input:
>  * Column*
>  * Map[String, String]
>  * (String, String)*
> I'm proposing to add Map[String, Seq[String]], where the keys are the columns 
> to aggregate, and the values are the functions to apply the aggregation. Here 
> is a similar question: 
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-multiple-agg-on-the-same-column-td29541.html.
> In the example below (running in spark-shell, with Spark 2.4.0), I'm showing 
> a workaround. What I'm proposing is that agg should accept aggMap as input:
> {code:java}
> scala> val df = Seq(("a", 1), ("a", 2), ("a", 3), ("a", 4), ("b", 10), ("b", 
> 20), ("c", 100)).toDF("col1", "col2")
> df: org.apache.spark.sql.DataFrame = [col1: string, col2: int]
> scala> df.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> |   a|   2|
> |   a|   3|
> |   a|   4|
> |   b|  10|
> |   b|  20|
> |   c| 100|
> +++
> scala> val aggMap = Map("col1" -> Seq("count"), "col2" -> Seq("min", "max", 
> "mean"))
> aggMap: scala.collection.immutable.Map[String,Seq[String]] = Map(col1 -> 
> List(count), col2 -> List(min, max, mean))
> scala> val aggSeq = aggMap.toSeq.flatMap{ case (c: String, fns: Seq[String]) 
> => Seq(c).zipAll(fns, c, "") }
> aggSeq: Seq[(String, String)] = ArrayBuffer((col1,count), (col2,min), 
> (col2,max), (col2,mean))
> scala> val dfAgg = df.groupBy("col1").agg(aggSeq.head, aggSeq.tail: _*)
> dfAgg: org.apache.spark.sql.DataFrame = [col1: string, count(col1): bigint 
> ... 3 more fields]
> scala> dfAgg.orderBy("col1").show
> ++---+-+-+-+
> |col1|count(col1)|min(col2)|max(col2)|avg(col2)|
> ++---+-+-+-+
> |   a|  4|1|4|  2.5|
> |   b|  2|   10|   20| 15.0|
> |   c|  1|  100|  100|100.0|
> ++---+-+-+-+
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2019-01-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751422#comment-16751422
 ] 

Marco Gaido commented on SPARK-18484:
-

[~bonazzaf] please do not delete comments, as they may be useful for other 
people with the same doubts.

That's right: there is no way currently to specify the precision and scale of a 
{{BigDecimal}} object, so by default (38, 18) is taken. It is not an issue 
because you can do everything using dataframes, you just need to specify the 
schema.

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>Priority: Major
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20162) Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)

2019-01-23 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749920#comment-16749920
 ] 

Marco Gaido commented on SPARK-20162:
-

[~bonazzaf] what you just reported is an invalid use case and Spark's answer is 
the right one. Since in {{Thing}} you have a BigDecimal, Spark infers it as the 
default decimal type, which is DECIMAL(38, 18). But since you have created the 
DataFrame using DECIMAL(38, 10), you are casting a DECIMAL(38, 10) to fit in a 
DECIMAL(38, 18): this is not possible, as it may cause overflows. So Spark 
throws the exception. This is the correct behavior.

> Reading data from MySQL - Cannot up cast from decimal(30,6) to decimal(38,18)
> -
>
> Key: SPARK-20162
> URL: https://issues.apache.org/jira/browse/SPARK-20162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Miroslav Spehar
>Priority: Major
>
> While reading data from MySQL, type conversion doesn't work for Decimal type 
> when the decimal in database is of lower precision/scale than the one spark 
> expects.
> Error:
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up 
> cast `DECIMAL_AMOUNT` from decimal(30,6) to decimal(38,18) as it may truncate
> The type path of the target object is:
> - field (class: "org.apache.spark.sql.types.Decimal", name: "DECIMAL_AMOUNT")
> - root class: "com.misp.spark.Structure"
> You can either add an explicit cast to the input data or choose a higher 
> precision type of the field in the target object;
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:2119)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2141)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2136)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:329)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:293)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:248)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:258)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:236)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2136)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2132)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)

[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-20 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16747501#comment-16747501
 ] 

Marco Gaido commented on SPARK-26639:
-

[~Jk_Self] I checked and there is only one subquery node in the plan. So I 
can't see how can any subquery being executed more than once. Maybe the stages 
you are checking are not the ones related to the subquery?

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26645) CSV infer schema bug infers decimal(9,-1)

2019-01-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745146#comment-16745146
 ] 

Marco Gaido commented on SPARK-26645:
-

The error is on python side, I will submit a PR shortly, thanks for reporting 
this.

> CSV infer schema bug infers decimal(9,-1)
> -
>
> Key: SPARK-26645
> URL: https://issues.apache.org/jira/browse/SPARK-26645
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> we have a file /tmp/t1/file.txt that contains only one line "1.18927098E9".
> running:
> {code:python}
> df = spark.read.csv('/tmp/t1', header=False, inferSchema=True, sep='\t')
> print df.dtypes
> {code}
> causes:
> {noformat}
> ValueError: Could not parse datatype: decimal(9,-1)
> {noformat}
> I'm not sure where the bug is - inferSchema or dtypes?
> I saw it is legal to have a decimal with negative scale in the code 
> (CSVInferSchema.scala):
> {code:python}
> if (bigDecimal.scale <= 0) {
> // `DecimalType` conversion can fail when
> //   1. The precision is bigger than 38.
> //   2. scale is bigger than precision.
> DecimalType(bigDecimal.precision, bigDecimal.scale)
>   } 
> {code}
> but what does it mean?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744799#comment-16744799
 ] 

Marco Gaido commented on SPARK-26639:
-

I see, then let me investigate this further.I think I already know the problem 
and have a solution for it, I need to test if I am right though. I'll create a 
PR for this in a few days as I am a bit overloaded now. Thanks.

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26639) The reuse subquery function maybe does not work in SPARK SQL

2019-01-17 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744783#comment-16744783
 ] 

Marco Gaido commented on SPARK-26639:
-

This may be a duplicate of SPARK-25482. Please may you try on current master?

> The reuse subquery function maybe does not work in SPARK SQL
> 
>
> Key: SPARK-26639
> URL: https://issues.apache.org/jira/browse/SPARK-26639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ke Jia
>Priority: Major
>
> The subquery reuse feature has done in 
> [https://github.com/apache/spark/pull/14548]
> In my test, I found the visualized plan do show the subquery is executed 
> once. But the stage of same subquery execute maybe not once.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26569) Fixed point for batch Operator Optimizations never reached when optimize logicalPlan

2019-01-14 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742305#comment-16742305
 ] 

Marco Gaido commented on SPARK-26569:
-

[~chenfan] may you please try a more recent version of Spark? Current master 
branch would be ideal, but Spark 2.4 should be enough. For instance SPARK-25402 
fixed BooleanSimplification, so most likely this problem has been solved.

> Fixed point for batch Operator Optimizations never reached when optimize 
> logicalPlan
> 
>
> Key: SPARK-26569
> URL: https://issues.apache.org/jira/browse/SPARK-26569
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment:  
>  
>Reporter: Chen Fan
>Priority: Major
>
> There is a bit complicated Spark App using DataSet api run once a day, and I 
> noticed the app will hang once in a while, 
>  I add some log and compare two driver log which one belong to successful 
> app, another belong to faied, and here is some results of investigation
> 1. Usually the app will running correctly, but sometime it will hang after 
> finishing job 1
> !image-2019-01-08-19-53-20-509.png!
> 2. According to log I append , the successful app always reach the fixed 
> point when iteration is 7 on Batch Operator Optimizations, but failed app 
> never reached this fixed point.
> {code:java}
> 2019-01-04,11:35:34,199 DEBUG org.apache.spark.sql.execution.SparkOptimizer: 
> === Result of Batch Operator Optimizations ===
> 2019-01-04,14:00:42,847 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 6/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,851 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:00:42,852 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:00:42,903 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:00:42,939 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:00:42,951 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PruneFilters ===
>  
> 2019-01-04,14:00:42,970 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 7/100, for batch Operator Optimizations
> 2019-01-04,14:00:42,971 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> Fixed point reached for batch Operator Optimizations after 7 iterations.
> 2019-01-04,14:13:15,616 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 45/100, for batch Operator Optimizations
> 2019-01-04,14:13:15,619 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:15,620 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:13:59,529 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:13:59,806 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:13:59,845 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 46/100, for batch Operator Optimizations
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin ===
>  
> 2019-01-04,14:13:59,849 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicate 
> ===
>  
> 2019-01-04,14:14:45,340 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints ===
>  
> 2019-01-04,14:14:45,631 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification ===
>  
> 2019-01-04,14:14:45,678 INFO org.apache.spark.sql.execution.SparkOptimizer: 
> iteration is 47/100, for batch Operator Optimizations
> {code}
> 3. The difference between two logical plan appear in BooleanSimplification on 
> iteration, before this rule, two logical plan is same:
> {code:java}
> // just a head part of plan
> Project [model#2486, version#12, 

  1   2   3   4   5   6   7   >