[jira] [Created] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-32126:
-

 Summary: Scope Session.active in IncrementalExecution
 Key: SPARK-32126
 URL: https://issues.apache.org/jira/browse/SPARK-32126
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32126:
--
Component/s: (was: SQL)

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32126:


Assignee: (was: Apache Spark)

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147539#comment-17147539
 ] 

Apache Spark commented on SPARK-32126:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/28936

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32126:


Assignee: Apache Spark

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32090) UserDefinedType.equal() does not have symmetry

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32090.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28923
[https://github.com/apache/spark/pull/28923]

> UserDefinedType.equal() does not have symmetry 
> ---
>
> Key: SPARK-32090
> URL: https://issues.apache.org/jira/browse/SPARK-32090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
> val udt1 = new ExampleBaseTypeUDT
> val udt2 = new ExampleSubTypeUDT
> println(udt1 == udt2) // true
> println(udt2 == udt1) // false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32090) UserDefinedType.equal() does not have symmetry

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32090:
-

Assignee: wuyi

> UserDefinedType.equal() does not have symmetry 
> ---
>
> Key: SPARK-32090
> URL: https://issues.apache.org/jira/browse/SPARK-32090
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
> val udt1 = new ExampleBaseTypeUDT
> val udt2 = new ExampleSubTypeUDT
> println(udt1 == udt2) // true
> println(udt2 == udt1) // false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32096) Support top-N sort for Spark SQL rank window function

2020-06-28 Thread Zikun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147548#comment-17147548
 ] 

Zikun edited comment on SPARK-32096 at 6/29/20, 4:44 AM:
-

Yes, we need to do top-N sort for each window partition in each physical 
partition. And I think this is doable. We are working on a POC of this 
improvement.


was (Author: xuzikun2003):
Yes, we need to do top-N sort for each window partition in each physical 
partition.

> Support top-N sort for Spark SQL rank window function
> -
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . 
> *_SortExec_* is a general sorting execution and it does not support top-N 
> sort. ​
> *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 
> Spark SQL rank window function needs to sort the data locally and it relies 
> on the execution plan *_SortExec_* to sort the data in each physical data 
> partition. When the filter of the window rank (e.g. rank <= 100) is specified 
> in a user's query, the filter can actually be pushed down to the SortExec and 
> then we let SortExec operates top-N sort. 
> Right now SortExec does not support top-N sort and we need to extend the 
> capability of SortExec to support top-N sort. 
> Or if SortExec is not considered as the right execution choice, we can create 
> a new execution plan called topNSortExec to do top-N sort in each local 
> partition if a filter on the window rank is specified. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toby Harradine updated SPARK-32123:
---
Affects Version/s: (was: 2.3.1)
   3.0.0

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-32123
> URL: https://issues.apache.org/jira/browse/SPARK-32123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Toby Harradine
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)
Toby Harradine created SPARK-32123:
--

 Summary: [Python] Setting `spark.sql.session.timeZone` only 
partially respected
 Key: SPARK-32123
 URL: https://issues.apache.org/jira/browse/SPARK-32123
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Toby Harradine


The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toby Harradine updated SPARK-32123:
---
Description: 
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 

  was:
The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-32123
> URL: https://issues.apache.org/jira/browse/SPARK-32123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Toby Harradine
>Priority: Major
>
> Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take 

[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toby Harradine updated SPARK-32123:
---
Labels:   (was: bulk-closed)

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-32123
> URL: https://issues.apache.org/jira/browse/SPARK-32123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Toby Harradine
>Priority: Major
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toby Harradine updated SPARK-32123:
---
Description: 
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

 

 

  was:
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

If the maintainers agree that this should be fixed, I would try to come up with 
a patch. 

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-32123
> URL: https://issues.apache.org/jira/browse/SPARK-32123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Toby Harradine
>Priority: Major
>
> Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> 

[jira] [Updated] (SPARK-32123) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Toby Harradine updated SPARK-32123:
---
Description: 
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting {{spark.sql.session.timeZone}} is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons {{datetime}} 
objects, its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method {{toPandas}} respected the timezone setting (UTC), but the 
method {{collect}} ignored it and converted the timestamp to my systems 
timezone.

The cause for this behaviour is that the methods {{toInternal}} and 
{{fromInternal}} of PySparks {{TimestampType}} class don't take into account 
the setting {{spark.sql.session.timeZone}} and use the system timezone.

 

 

  was:
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting `spark.sql.session.timeZone` is respected by PySpark when 
converting from and to Pandas, as described 
[here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
 However, when timestamps are converted directly to Pythons `datetime` objects, 
its ignored and the systems timezone is used.

This can be checked by the following code snippet
{code:java}
import pyspark.sql

spark = (pyspark
 .sql
 .SparkSession
 .builder
 .master('local[1]')
 .config("spark.sql.session.timeZone", "UTC")
 .getOrCreate()
)

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])
{code}
Which for me prints (the exact result depends on the timezone of your system, 
mine is Europe/Berlin)
{code:java}
2018-06-01 01:00:00
2018-06-01 03:00:00
{code}
Hence, the method `toPandas` respected the timezone setting (UTC), but the 
method `collect` ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods `toInternal` and 
`fromInternal` of PySparks `TimestampType` class don't take into account the 
setting `spark.sql.session.timeZone` and use the system timezone.

 

 


> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-32123
> URL: https://issues.apache.org/jira/browse/SPARK-32123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Toby Harradine
>Priority: Major
>
> Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.
> The setting {{spark.sql.session.timeZone}} is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons {{datetime}} 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method {{toPandas}} respected the timezone setting (UTC), but the 
> method {{collect}} ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods {{toInternal}} and 
> {{fromInternal}} of PySparks {{TimestampType}} class don't take into account 
> the setting {{spark.sql.session.timeZone}} and use the system 

[jira] [Commented] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147464#comment-17147464
 ] 

Toby Harradine commented on SPARK-25244:


Thanks for letting me know.

I've just created SPARK-32123 which marks affected version as 3.0.0.

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25244) [Python] Setting `spark.sql.session.timeZone` only partially respected

2020-06-28 Thread Toby Harradine (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147464#comment-17147464
 ] 

Toby Harradine edited comment on SPARK-25244 at 6/28/20, 10:44 PM:
---

Thanks for letting me know.

I've just created SPARK-32123 which marks affected version as 3.0.0. 
Reproduction steps and analysis is the same as it is here.


was (Author: toby.harradine):
Thanks for letting me know.

I've just created SPARK-32123 which marks affected version as 3.0.0.

> [Python] Setting `spark.sql.session.timeZone` only partially respected
> --
>
> Key: SPARK-25244
> URL: https://issues.apache.org/jira/browse/SPARK-25244
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Anton Daitche
>Priority: Major
>  Labels: bulk-closed
>
> The setting `spark.sql.session.timeZone` is respected by PySpark when 
> converting from and to Pandas, as described 
> [here|http://spark.apache.org/docs/latest/sql-programming-guide.html#timestamp-with-time-zone-semantics].
>  However, when timestamps are converted directly to Pythons `datetime` 
> objects, its ignored and the systems timezone is used.
> This can be checked by the following code snippet
> {code:java}
> import pyspark.sql
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .getOrCreate()
> )
> df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
> df = df.withColumn("ts", df["ts"].astype("timestamp"))
> print(df.toPandas().iloc[0,0])
> print(df.collect()[0][0])
> {code}
> Which for me prints (the exact result depends on the timezone of your system, 
> mine is Europe/Berlin)
> {code:java}
> 2018-06-01 01:00:00
> 2018-06-01 03:00:00
> {code}
> Hence, the method `toPandas` respected the timezone setting (UTC), but the 
> method `collect` ignored it and converted the timestamp to my systems 
> timezone.
> The cause for this behaviour is that the methods `toInternal` and 
> `fromInternal` of PySparks `TimestampType` class don't take into account the 
> setting `spark.sql.session.timeZone` and use the system timezone.
> If the maintainers agree that this should be fixed, I would try to come up 
> with a patch. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32124:


 Summary: [SHS] Failed to parse FetchFailed TaskEndReason from 
event log produce by Spark 2.4 
 Key: SPARK-32124
 URL: https://issues.apache.org/jira/browse/SPARK-32124
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
due to missing field "Map Index".

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147465#comment-17147465
 ] 

Apache Spark commented on SPARK-32124:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28941

> [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by 
> Spark 2.4 
> 
>
> Key: SPARK-32124
> URL: https://issues.apache.org/jira/browse/SPARK-32124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
> due to missing field "Map Index".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32124:


Assignee: (was: Apache Spark)

> [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by 
> Spark 2.4 
> 
>
> Key: SPARK-32124
> URL: https://issues.apache.org/jira/browse/SPARK-32124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
> due to missing field "Map Index".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32124:


Assignee: Apache Spark

> [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by 
> Spark 2.4 
> 
>
> Key: SPARK-32124
> URL: https://issues.apache.org/jira/browse/SPARK-32124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
> due to missing field "Map Index".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function

2020-06-28 Thread Zikun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147538#comment-17147538
 ] 

Zikun commented on SPARK-32096:
---

[~viirya] Yes, a filter of window rank <= 100 means a top-100 sort. And yes 
again for your second statement. The filter predicate needs to be applied on 
each window partition. 

> Support top-N sort for Spark SQL rank window function
> -
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . 
> *_SortExec_* is a general sorting execution and it does not support top-N 
> sort. ​
> *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 
> Spark SQL rank window function needs to sort the data locally and it relies 
> on the execution plan *_SortExec_* to sort the data in each physical data 
> partition. When the filter of the window rank (e.g. rank <= 100) is specified 
> in a user's query, the filter can actually be pushed down to the SortExec and 
> then we let SortExec operates top-N sort. 
> Right now SortExec does not support top-N sort and we need to extend the 
> capability of SortExec to support top-N sort. 
> Or if SortExec is not considered as the right execution choice, we can create 
> a new execution plan called topNSortExec to do top-N sort in each local 
> partition if a filter on the window rank is specified. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function

2020-06-28 Thread Zikun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147548#comment-17147548
 ] 

Zikun commented on SPARK-32096:
---

Yes, we need to do top-N sort for each window partition in each physical 
partition.

> Support top-N sort for Spark SQL rank window function
> -
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . 
> *_SortExec_* is a general sorting execution and it does not support top-N 
> sort. ​
> *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 
> Spark SQL rank window function needs to sort the data locally and it relies 
> on the execution plan *_SortExec_* to sort the data in each physical data 
> partition. When the filter of the window rank (e.g. rank <= 100) is specified 
> in a user's query, the filter can actually be pushed down to the SortExec and 
> then we let SortExec operates top-N sort. 
> Right now SortExec does not support top-N sort and we need to extend the 
> capability of SortExec to support top-N sort. 
> Or if SortExec is not considered as the right execution choice, we can create 
> a new execution plan called topNSortExec to do top-N sort in each local 
> partition if a filter on the window rank is specified. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32122) Exception while writing dataframe with enum fields

2020-06-28 Thread Sai kiran Krishna murthy (Jira)
Sai kiran Krishna murthy created SPARK-32122:


 Summary: Exception while writing dataframe with enum fields
 Key: SPARK-32122
 URL: https://issues.apache.org/jira/browse/SPARK-32122
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.3
Reporter: Sai kiran Krishna murthy


I have an avro schema with one field which is an enum and I am trying to 
enforce this schema when I am writing my dataframe, the code looks something 
like this
{code:java}
case class Name1(id:String,count:Int,val_type:String)

val schema = """{
 |  "type" : "record",
 |  "name" : "name1",
 |  "namespace" : "com.data",
 |  "fields" : [
 |  {
 |"name" : "id",
 |"type" : "string"
 |  },
 |  {
 |"name" : "count",
 |"type" : "int"
 |  },
 |  {
 |"name" : "val_type",
 |"type" : {
 |  "type" : "enum",
 |  "name" : "ValType",
 |  "symbols" : [ "s1", "s2" ]
 |}
 |  }
 |  ]
 |}""".stripMargin

val df = Seq(
Name1("1",2,"s1"),
Name1("1",3,"s2"),
Name1("1",4,"s2"),
Name1("11",2,"s1")).toDF()

df.write.format("avro").option("avroSchema",schema).save("data/tes2/")
{code}
This code fails with the following exception,

 
{noformat}
2020-06-28 23:28:10 ERROR Utils:91 - Aborting task
org.apache.avro.AvroRuntimeException: Not a union: "string"
at org.apache.avro.Schema.getTypes(Schema.java:299)
at 
org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
at 
org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
at 
org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:208)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:296)
at 
org.apache.spark.sql.avro.AvroSerializer.newStructConverter(AvroSerializer.scala:208)
at 
org.apache.spark.sql.avro.AvroSerializer.(AvroSerializer.scala:51)
at 
org.apache.spark.sql.avro.AvroOutputWriter.serializer$lzycompute(AvroOutputWriter.scala:42)
at 
org.apache.spark.sql.avro.AvroOutputWriter.serializer(AvroOutputWriter.scala:42)
at 
org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:64)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2020-06-28 23:28:10 ERROR Utils:91 - Aborting task{noformat}
 

I understand this is because of the type of val_type is  `String` in the case 
class. Can you please advice how I can solve this problem without having to 
change the underlying avro schema? 

Thanks!



--
This message 

[jira] [Commented] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147469#comment-17147469
 ] 

Apache Spark commented on SPARK-32125:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28942

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147470#comment-17147470
 ] 

Apache Spark commented on SPARK-32125:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28942

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32125:


Assignee: (was: Apache Spark)

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32125:


Assignee: Apache Spark

> [UI] Support get taskList by status in Web UI and SHS Rest API
> --
>
> Key: SPARK-32125
> URL: https://issues.apache.org/jira/browse/SPARK-32125
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Support fetching taskList by status as below:
> /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32124:
-

Assignee: Zhongwei Zhu

> [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by 
> Spark 2.4 
> 
>
> Key: SPARK-32124
> URL: https://issues.apache.org/jira/browse/SPARK-32124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
>
> When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
> due to missing field "Map Index".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32124) [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by Spark 2.4

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32124.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28941
[https://github.com/apache/spark/pull/28941]

> [SHS] Failed to parse FetchFailed TaskEndReason from event log produce by 
> Spark 2.4 
> 
>
> Key: SPARK-32124
> URL: https://issues.apache.org/jira/browse/SPARK-32124
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> When read event log produced by Spark 2.4.4, parsing TaskEndReason failed to 
> due to missing field "Map Index".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32125) [UI] Support get taskList by status in Web UI and SHS Rest API

2020-06-28 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-32125:


 Summary: [UI] Support get taskList by status in Web UI and SHS 
Rest API
 Key: SPARK-32125
 URL: https://issues.apache.org/jira/browse/SPARK-32125
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Zhongwei Zhu


Support fetching taskList by status as below:

/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function

2020-06-28 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147504#comment-17147504
 ] 

L. C. Hsieh commented on SPARK-32096:
-

Does a filter of the window rank (e.g. rank <= 100) mean top-100 sort? A such 
filter means the rows with rank <= 100 for each window partition. In each 
physical partition, it could contain many window partitions. The filter 
predicate needs to be applied on each window partition.

> Support top-N sort for Spark SQL rank window function
> -
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . 
> *_SortExec_* is a general sorting execution and it does not support top-N 
> sort. ​
> *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 
> Spark SQL rank window function needs to sort the data locally and it relies 
> on the execution plan *_SortExec_* to sort the data in each physical data 
> partition. When the filter of the window rank (e.g. rank <= 100) is specified 
> in a user's query, the filter can actually be pushed down to the SortExec and 
> then we let SortExec operates top-N sort. 
> Right now SortExec does not support top-N sort and we need to extend the 
> capability of SortExec to support top-N sort. 
> Or if SortExec is not considered as the right execution choice, we can create 
> a new execution plan called topNSortExec to do top-N sort in each local 
> partition if a filter on the window rank is specified. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32126.
---
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28936
[https://github.com/apache/spark/pull/28936]

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32126) Scope Session.active in IncrementalExecution

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32126:
-

Assignee: Yuanjian Li

> Scope Session.active in IncrementalExecution
> 
>
> Key: SPARK-32126
> URL: https://issues.apache.org/jira/browse/SPARK-32126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32096) Support top-N sort for Spark SQL rank window function

2020-06-28 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147542#comment-17147542
 ] 

L. C. Hsieh commented on SPARK-32096:
-

Then I think it is not a simply top-N sort...

You need to do top-N sort for each window partition in each physical partition.

> Support top-N sort for Spark SQL rank window function
> -
>
> Key: SPARK-32096
> URL: https://issues.apache.org/jira/browse/SPARK-32096
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Any environment that supports Spark.
>Reporter: Zikun
>Priority: Major
>
> In Spark SQL, there are two types of sort execution, *_SortExec_* and 
> *_TakeOrderedAndProjectExec_* . 
> *_SortExec_* is a general sorting execution and it does not support top-N 
> sort. ​
> *_TakeOrderedAndProjectExec_* is the execution for top-N sort in Spark. 
> Spark SQL rank window function needs to sort the data locally and it relies 
> on the execution plan *_SortExec_* to sort the data in each physical data 
> partition. When the filter of the window rank (e.g. rank <= 100) is specified 
> in a user's query, the filter can actually be pushed down to the SortExec and 
> then we let SortExec operates top-N sort. 
> Right now SortExec does not support top-N sort and we need to extend the 
> capability of SortExec to support top-N sort. 
> Or if SortExec is not considered as the right execution choice, we can create 
> a new execution plan called topNSortExec to do top-N sort in each local 
> partition if a filter on the window rank is specified. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31823) Improve the current Spark Scheduler test framework

2020-06-28 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147224#comment-17147224
 ] 

jiaan.geng commented on SPARK-31823:


This ticket just used to improve test framework of Spark Scheduler. The initial 
demand came from [~jiangxb]

> Improve the current Spark Scheduler test framework
> --
>
> Key: SPARK-31823
> URL: https://issues.apache.org/jira/browse/SPARK-31823
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Minor
>
> The major source of Spark Scheduler unit test cases are 
> DAGSchedulerSuite、TaskSchedulerImplSuite、TaskSetManagerSuite. These test 
> suites have played an important role to ensure the Spark Scheduler behaves as 
> we expected, however, we should significantly improve these suites to provide 
> better organized and more extendable test framework now, to further support 
> the evolution of the Spark Scheduler.
> The major limitations of the current Spark Scheduler test framework:
> * The test framework was designed at the very early stage of Spark, so it 
> doesn’t integrate well with the features introduced later, e.g. barrier 
> execution, indeterminate stage, zombie taskset, resource profile.
> * Many test cases are added in a hacky way, don’t fully utilize or expend the 
> original test framework (while they could have been), this leads to a heavy 
> maintenance burden.
> * The test cases are not organized well, many test cases are appended case by 
> case, each test file consists of thousands of LOCs.
> * Frequently introducing flaky test cases because there is no standard way to 
> generate test data and verify the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-06-28 Thread Xiaochen Ouyang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147241#comment-17147241
 ] 

Xiaochen Ouyang commented on SPARK-31095:
-

Hello  [~dongjoon], Can netty-all upgrade solve CVE-2020-9480 security 
vulnerability metioned on the Spark official website? Thanks!

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: security
> Fix For: 2.4.6, 3.0.0
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32114) Change name of the slaves file, to something more acceptable

2020-06-28 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147227#comment-17147227
 ] 

L. C. Hsieh commented on SPARK-32114:
-

I think this might be duplicate to SPARK-32004.

> Change name of the slaves file, to something more acceptable
> 
>
> Key: SPARK-32114
> URL: https://issues.apache.org/jira/browse/SPARK-32114
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Arvind Krishnan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog

2020-06-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32118:
--

 Summary: Use fine-grained read write lock for each database in 
HiveExternalCatalog
 Key: SPARK-32118
 URL: https://issues.apache.org/jira/browse/SPARK-32118
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Lantao Jin


In HiveExternalCatalog, all metastore operations are synchronized by a same 
object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's 
queries may be stuck by any a long operation. For example, if a user is 
accessing a table which contains mass partitions, the operation 
loadDynamicPartitions() holds the object lock for a long time. All queries are 
blocking to wait for the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32118:


Assignee: Apache Spark

> Use fine-grained read write lock for each database in HiveExternalCatalog
> -
>
> Key: SPARK-32118
> URL: https://issues.apache.org/jira/browse/SPARK-32118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
>
> In HiveExternalCatalog, all metastore operations are synchronized by a same 
> object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's 
> queries may be stuck by any a long operation. For example, if a user is 
> accessing a table which contains mass partitions, the operation 
> loadDynamicPartitions() holds the object lock for a long time. All queries 
> are blocking to wait for the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147353#comment-17147353
 ] 

Apache Spark commented on SPARK-32118:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28938

> Use fine-grained read write lock for each database in HiveExternalCatalog
> -
>
> Key: SPARK-32118
> URL: https://issues.apache.org/jira/browse/SPARK-32118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In HiveExternalCatalog, all metastore operations are synchronized by a same 
> object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's 
> queries may be stuck by any a long operation. For example, if a user is 
> accessing a table which contains mass partitions, the operation 
> loadDynamicPartitions() holds the object lock for a long time. All queries 
> are blocking to wait for the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32118) Use fine-grained read write lock for each database in HiveExternalCatalog

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32118:


Assignee: (was: Apache Spark)

> Use fine-grained read write lock for each database in HiveExternalCatalog
> -
>
> Key: SPARK-32118
> URL: https://issues.apache.org/jira/browse/SPARK-32118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> In HiveExternalCatalog, all metastore operations are synchronized by a same 
> object lock. In a heavy traffic Spark thriftserver or Spark Driver, users's 
> queries may be stuck by any a long operation. For example, if a user is 
> accessing a table which contains mass partitions, the operation 
> loadDynamicPartitions() holds the object lock for a long time. All queries 
> are blocking to wait for the lock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32109) SQL hash function handling of nulls makes collision too likely

2020-06-28 Thread Chen Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147328#comment-17147328
 ] 

Chen Zhang commented on SPARK-32109:


The logic in the source code can be represented by the following pseudocode.
{code:scala}
def computeHash(value: Any, hashSeed: Long): Long = {
  value match {
case null => hashSeed
case b: Boolean => hashInt(if (b) 1 else 0, hashSeed)  // Murmur3Hash
case i: Int => hashInt(i, hashSeed)
...
  }
}
val seed = 42L
var hash = seed
var i = 0
val len = columns.length
while (i < len) {
  hash = computeHash(columns(i).value, hash)
  i += 1
}
hash
{code}
I can solve this problem by modifying the following code.
 (eval function and doGenCode function in 
org.apache.spark.sql.catalyst.expressions.HashExpression class)
{code:scala}
override def eval(input: InternalRow = null): Any = {
  var hash = seed
  var i = 0
  val len = children.length
  while (i < len) {
//hash = computeHash(children(i).eval(input), children(i).dataType, hash)
hash = (31 * hash) + computeHash(children(i).eval(input), 
children(i).dataType, hash)
i += 1
  }
  hash
}
{code}
But I don't think it's necessary to modify the code, and if we do, it will 
affect the existing data distribution.

> SQL hash function handling of nulls makes collision too likely
> --
>
> Key: SPARK-32109
> URL: https://issues.apache.org/jira/browse/SPARK-32109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> this ticket is about org.apache.spark.sql.functions.hash and sparks handling 
> of nulls when hashing sequences.
> {code:java}
> scala> spark.sql("SELECT hash('bar', null)").show()
> +---+
> |hash(bar, NULL)|
> +---+
> |-1808790533|
> +---+
> scala> spark.sql("SELECT hash(null, 'bar')").show()
> +---+
> |hash(NULL, bar)|
> +---+
> |-1808790533|
> +---+
>  {code}
> these are differences sequences. e.g. these could be positions 0 and 1 in a 
> dataframe which are diffferent columns with entirely different meanings. the 
> hashes should not be the same.
> another example:
> {code:java}
> scala> Seq(("john", null), (null, "john")).toDF("name", 
> "alias").withColumn("hash", hash(col("name"), col("alias"))).show
> ++-+-+
> |name|alias| hash|
> ++-+-+
> |john| null|487839701|
> |null| john|487839701|
> ++-+-+ {code}
> instead of ignoring nulls each null show do a transform to the hash so that 
> the order of elements including the nulls matters for the outcome.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: huber.xlsx

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147261#comment-17147261
 ] 

zhengruifeng commented on SPARK-32060:
--

{code:java}
import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", 
"2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.countval lir = new 
LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")val 
results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) } {code}
 

model coef:
{code:java}
scala> results.map(_._2.coefficients).foreach(coef => 
println(coef.toString.take(200)))
[-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0
[-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0
[0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0.
[0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0
[0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809
[-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242
[-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065
 {code}
 

objectiveHistory is also attached

 

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> 

[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Description: 
|performace test in https://issues.apache.org/jira/browse/SPARK-31783,
 Huber loss seems start to diverge since 70 iters.
  {code:java}
 for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
    Thread.sleep(1)
    val hlir = new 
LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
    val start = System.currentTimeMillis
    val model = hlir.setBlockSize(size).fit(df)
    val end = System.currentTimeMillis
    println((model.uid, size, iter, end - start, 
model.summary.objectiveHistory.last, model.summary.totalIterations, 
model.coefficients.toString.take(100)))
}{code}|
| |
| |
| |
| |
| |
| |
| |
| |
|result:|
|blockSize=1|
|(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
|(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
|(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|

blockSize=4|
|(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
|(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
|(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|

blockSize=16|
|(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
|(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
|(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|

blockSize=64|
|(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
|(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
|(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|

  was:
|performace test in https://issues.apache.org/jira/browse/SPARK-31783,
Huber loss seems start to diverge since 50 iters.
  {code:java}
 for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
    Thread.sleep(1)
    val hlir = new 
LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
    val start = System.currentTimeMillis
    val model = hlir.setBlockSize(size).fit(df)
    val end = System.currentTimeMillis
    println((model.uid, size, iter, end - start, 
model.summary.objectiveHistory.last, model.summary.totalIterations, 
model.coefficients.toString.take(100)))
}{code}|
| |
| |
| |
| |
| |
| |
| |
| |
|result:|
|blockSize=1|
|(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
|(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
|(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|

blockSize=4|
|(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
|(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
|(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|

blockSize=16|
|(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
|(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
|(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|

blockSize=64|

[jira] [Commented] (SPARK-31851) Redesign PySpark documentation

2020-06-28 Thread Manish Khobragade (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147271#comment-17147271
 ] 

Manish Khobragade commented on SPARK-31851:
---

I would also like to help with this. 

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: (was: huber.xlsx)

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
> Huber loss seems start to diverge since 50 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147261#comment-17147261
 ] 

zhengruifeng edited comment on SPARK-32060 at 6/28/20, 8:34 AM:


{code:java}
import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", 
"2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.countval lir = new 
LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) } {code}
 

model coef:
{code:java}
scala> results.map(_._2.coefficients).foreach(coef => 
println(coef.toString.take(200)))
[-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0
[-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0
[0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0.
[0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0
[0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809
[-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242
[-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065
 {code}
 

objectiveHistory is also attached

 


was (Author: podongfeng):
{code:java}
import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", 
"2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.countval lir = new 
LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")val 
results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) } {code}
 

model coef:
{code:java}
scala> results.map(_._2.coefficients).foreach(coef => 
println(coef.toString.take(200)))
[-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0
[-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0
[0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0.
[0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0
[0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809
[-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242
[-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065
 {code}
 

objectiveHistory is also attached

 

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> 

[jira] [Comment Edited] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147261#comment-17147261
 ] 

zhengruifeng edited comment on SPARK-32060 at 6/28/20, 8:34 AM:


{code:java}
import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", 
"2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

val lir = new 
LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) } {code}
 

model coef:
{code:java}
scala> results.map(_._2.coefficients).foreach(coef => 
println(coef.toString.take(200)))
[-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0
[-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0
[0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0.
[0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0
[0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809
[-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242
[-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065
 {code}
 

objectiveHistory is also attached

 


was (Author: podongfeng):
{code:java}
import org.apache.spark.ml.regression._
import org.apache.spark.storage.StorageLevel
val df = spark.read.option("numFeatures", 
"2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t")
df.persist(StorageLevel.MEMORY_AND_DISK)
df.countval lir = new 
LinearRegression().setMaxIter(200).setSolver("l-bfgs").setLoss("huber")

val results = Seq(1, 4, 16, 64, 256, 1024, 4096).map { size => val start = 
System.currentTimeMillis; val model = lir.setBlockSize(size).fit(df); val end = 
System.currentTimeMillis; (size, model, end - start) } {code}
 

model coef:
{code:java}
scala> results.map(_._2.coefficients).foreach(coef => 
println(coef.toString.take(200)))
[-0.1609083025667508,-0.1504208122473649,0.7857316265190127,0.1905294278240982,0.48613646504894936,-0.026194861709278365,0.590635887747112,0.03185142111622796,8.347531055523673,0.05032008235983659,0.0
[-0.14168611353422972,-0.09988761525554064,0.5465392380563737,0.1948729061499901,0.4763355879043651,-0.3012279914216939,0.6313906259537879,0.09533675545276975,10.461020810672274,0.15677230833505942,-0
[0.0129107378236514,-0.023733643262643805,0.7206248421409548,0.1281202961920889,0.6331850100541732,-0.07297545577093478,0.7943888663518902,0.1345404102446435,10.426743282094897,0.022989137878464405,0.
[0.030744371107965504,-0.18953315635218193,0.7474602191912736,0.1759290649344934,0.48334851886329333,-0.18612454543317197,0.623576899875435,0.10960148194302292,9.305819813630439,0.07680152463656026,-0
[0.06489015002773292,-0.2013517907421197,0.7090030134636589,0.05515361023479412,0.3904484093136326,0.11987256805921637,0.550217950324033,0.0557189628809737,7.24524505892832,-0.09041629158543917,0.0809
[-0.18300047132898184,-0.21732260127922864,0.8444018472270687,0.10275527109275327,0.07750772677176482,0.2282620884662859,0.5299055708518087,0.07284146396600312,7.7820378386877245,-0.014623101293592242
[-0.09575146808314546,-0.2307269364289983,0.8121553524047764,0.14527766692142594,0.4327749717709629,-0.024082387632074886,0.6239466285761414,0.03986689640912914,7.6761329131634435,-0.0369776197065
 {code}
 

objectiveHistory is also attached

 

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> 

[jira] [Commented] (SPARK-32108) Silent mode of spark-sql is broken

2020-06-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147264#comment-17147264
 ] 

Lantao Jin commented on SPARK-32108:


[~maxgekk] I think it works. The INFO logs only print in spark-sql starting.

> Silent mode of spark-sql is broken
> --
>
> Key: SPARK-32108
> URL: https://issues.apache.org/jira/browse/SPARK-32108
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> 1. I download the recent release Spark 3.0 from 
> http://spark.apache.org/downloads.html
> 2. Run bin/spark-sql -S, it prints a lot of INFO
> {code}
> ➜  ~ ./spark-3.0/bin/spark-sql -S
> 20/06/26 20:43:38 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.hive.conf.HiveConf).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 20/06/26 20:43:39 INFO SharedState: spark.sql.warehouse.dir is not set, but 
> hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the 
> value of hive.metastore.warehouse.dir ('/user/hive/warehouse').
> 20/06/26 20:43:39 INFO SharedState: Warehouse path is '/user/hive/warehouse'.
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created local directory: 
> /var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a
> 20/06/26 20:43:39 INFO SessionState: Created HDFS directory: 
> /tmp/hive/maximgekk/a47e882c-86a3-42b9-b43f-9dab0dd8492a/_tmp_space.db
> 20/06/26 20:43:39 INFO SparkContext: Running Spark version 3.0.0
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO ResourceUtils: Resources for spark.driver:
> 20/06/26 20:43:39 INFO ResourceUtils: 
> ==
> 20/06/26 20:43:39 INFO SparkContext: Submitted application: 
> SparkSQL::192.168.1.78
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls to: maximgekk
> 20/06/26 20:43:39 INFO SecurityManager: Changing view acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: Changing modify acls groups to:
> 20/06/26 20:43:39 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(maximgekk); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(maximgekk); groups with modify permissions: Set()
> 20/06/26 20:43:39 INFO Utils: Successfully started service 'sparkDriver' on 
> port 59414.
> 20/06/26 20:43:39 INFO SparkEnv: Registering MapOutputTracker
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMaster
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 20/06/26 20:43:39 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 20/06/26 20:43:39 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 20/06/26 20:43:39 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tcgn/T/blockmgr-c1d041ad-dd46-4d11-bbd0-e8ba27d3bf69
> 20/06/26 20:43:39 INFO MemoryStore: MemoryStore started with capacity 408.9 
> MiB
> 20/06/26 20:43:39 INFO SparkEnv: Registering OutputCommitCoordinator
> 20/06/26 20:43:40 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 20/06/26 20:43:40 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://192.168.1.78:4040
> 20/06/26 20:43:40 INFO Executor: Starting executor ID driver on host 
> 192.168.1.78
> 20/06/26 20:43:40 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59415.
> 20/06/26 20:43:40 INFO NettyBlockTransferService: Server created on 
> 192.168.1.78:59415
> 20/06/26 20:43:40 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMasterEndpoint: Registering block manager 
> 192.168.1.78:59415 with 408.9 MiB RAM, BlockManagerId(driver, 192.168.1.78, 
> 59415, None)
> 20/06/26 20:43:40 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, 192.168.1.78, 59415, None)

[jira] [Resolved] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-32117.

Resolution: Won't Fix

> Thread spark-listener-group-streams is cpu costing
> --
>
> Key: SPARK-32117
> URL: https://issues.apache.org/jira/browse/SPARK-32117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
> even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147278#comment-17147278
 ] 

Lantao Jin commented on SPARK-32117:


I think it might be fixed by SPARK-29423

> Thread spark-listener-group-streams is cpu costing
> --
>
> Key: SPARK-32117
> URL: https://issues.apache.org/jira/browse/SPARK-32117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
> even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: image-2020-06-28-18-05-28-867.png

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx, image-2020-06-28-18-05-28-867.png
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147295#comment-17147295
 ] 

zhengruifeng commented on SPARK-32060:
--

According to the convergence curves of different blockSize, the objective value 
start to diverge since iter=70, but finally convenge to 0.67~0.68 at iter=200;

As to the solutions, the coefficient looks different.

 

refer to [https://en.wikipedia.org/wiki/Least_absolute_deviations:]

*L1-Loss is robust, but not stable (Possibly multiple solutions); L2-Loss is 
Not very robust, but is stable (Always one solution)*

Huber is a mix of both L1-Loss and L2-Loss: at each iteration, some instances 
are used with L1-Loss, while others with L2-Loss. So I personally think Huber 
is between L1-Loss and L2-Loss, that there maybe multiple solutions in Huber 
Regression.

 

ping [~weichenxu123]

 

 

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx, image-2020-06-28-18-05-28-867.png
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32018) Fix UnsafeRow set overflowed decimal

2020-06-28 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147313#comment-17147313
 ] 

angerszhu commented on SPARK-32018:
---

[~allisonwang-db] 

Can you show a test case to reproduce this?

> Fix UnsafeRow set overflowed decimal
> 
>
> Key: SPARK-32018
> URL: https://issues.apache.org/jira/browse/SPARK-32018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Allison Wang
>Priority: Major
>
> There is a bug that writing an overflowed decimal into UnsafeRow is fine but 
> reading it out will throw ArithmeticException. This exception is thrown when 
> calling {{getDecimal}} in UnsafeRow with input decimal's precision greater 
> than the input precision. Setting the value of the overflowed decimal to null 
> when writing into UnsafeRow should fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32112) Easier way to repartition/coalesce DataFrames based on the number of parallel tasks that Spark can process at the same time

2020-06-28 Thread Noritaka Sekiyama (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noritaka Sekiyama updated SPARK-32112:
--
Description: 
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
 This issue is to add a easier way to repartition/coalesce DataFrames based on 
the number of parallel tasks that Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:
 - repartition with the calculated parallel tasks

Notes:
 - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed by 
Spark apps.
 - `SparkContext.getExecutorMemoryStatus` might help to calculate the number of 
available slots to process tasks.

  was:
Repartition/coalesce is very important to optimize Spark application's 
performance, however, a lot of users are struggling with determining the number 
of partitions.
 This issue is to add a easier way to repartition/coalesce DataFrames based on 
the number of parallel tasks that Spark can process at the same time.

It will help Spark users to determine the optimal number of partitions.

Expected use-cases:
 - repartition with the calculated parallel tasks

 

There is `SparkContext.maxNumConcurrentTasks` but it cannot be accessed by 
Spark apps.


> Easier way to repartition/coalesce DataFrames based on the number of parallel 
> tasks that Spark can process at the same time
> ---
>
> Key: SPARK-32112
> URL: https://issues.apache.org/jira/browse/SPARK-32112
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Repartition/coalesce is very important to optimize Spark application's 
> performance, however, a lot of users are struggling with determining the 
> number of partitions.
>  This issue is to add a easier way to repartition/coalesce DataFrames based 
> on the number of parallel tasks that Spark can process at the same time.
> It will help Spark users to determine the optimal number of partitions.
> Expected use-cases:
>  - repartition with the calculated parallel tasks
> Notes:
>  - `SparkContext.maxNumConcurrentTasks` might help but it cannot be accessed 
> by Spark apps.
>  - `SparkContext.getExecutorMemoryStatus` might help to calculate the number 
> of available slots to process tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: huber.xlsx

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
> Huber loss seems start to diverge since 50 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32117) Thread spark-listener-group-streams is cpu costing

2020-06-28 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-32117:
--

 Summary: Thread spark-listener-group-streams is cpu costing
 Key: SPARK-32117
 URL: https://issues.apache.org/jira/browse/SPARK-32117
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Lantao Jin


In a busy driver (OLAP), thread spark-listener-group-streams is cpu costing 
even though in a non-streaming application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: huber.xlsx

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32060) Huber loss Convergence

2020-06-28 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-32060:
-
Attachment: (was: huber.xlsx)

> Huber loss Convergence
> --
>
> Key: SPARK-32060
> URL: https://issues.apache.org/jira/browse/SPARK-32060
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
> Attachments: huber.xlsx
>
>
> |performace test in https://issues.apache.org/jira/browse/SPARK-31783,
>  Huber loss seems start to diverge since 70 iters.
>   {code:java}
>  for (size <- Seq(1, 4, 16, 64); iter <- Seq(10, 50, 100)) {
>     Thread.sleep(1)
>     val hlir = new 
> LinearRegression().setLoss("huber").setSolver("l-bfgs").setMaxIter(iter).setTol(0)
>     val start = System.currentTimeMillis
>     val model = hlir.setBlockSize(size).fit(df)
>     val end = System.currentTimeMillis
>     println((model.uid, size, iter, end - start, 
> model.summary.objectiveHistory.last, model.summary.totalIterations, 
> model.coefficients.toString.take(100)))
> }{code}|
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> |result:|
> |blockSize=1|
> |(linReg_887d29a0b42b,1,10,34222,12.600287516874573,11,[-1.128806276706593,8.677674008637235,9.388511222747894,8.55780534824698,34.241366265505654,26.96490)|
> |(linReg_fa87d52d3e2f,1,50,134017,1.7265674039265724,51,[-1.2409375311919224,-0.36565818648554393,1.0271741000977583,-0.5264376930209739,-1.544463380879014,)|
> |(linReg_b2a07f6fa653,1,100,259137,0.7519335552972538,101,[-0.3821288691282684,0.22040814987367136,0.07747613675383101,0.16130205219214436,1.2347926613828966,)|
> blockSize=4|
> |(linReg_779f6890aee9,4,10,7241,12.600287516879131,11,[-1.128806276706101,8.677674008649985,9.38851122275203,8.557805348259139,34.241366265511715,26.96490)|
> |(linReg_0e6d961e054f,4,50,11691,1.726567383577527,51,[-1.2409376473684588,-0.3656580427637058,1.0271741488856692,-0.5264377459728347,-1.5444635623477996,)|
> |(linReg_1e12fafab7d2,4,100,17966,0.796858465032771,101,[-0.014663920062692357,-0.057216366204118345,0.1764582527782608,0.12141286532514688,1.58266258533765)|
> blockSize=16|
> |(linReg_5ad195c843bb,16,10,7338,12.600287516896273,11,[-1.1288062767576779,8.677674008672964,9.388511222753797,8.557805348281347,34.24136626552257,26.9649)|
> |(linReg_686fe7849c42,16,50,12093,1.7265673762478049,51,[-1.2409376965631724,-0.3656579898205299,1.0271741857198382,-0.5264377659307408,-1.5444636325154564,)|
> |(linReg_cc934209aac1,16,100,18253,0.7844992170383625,101,[-0.4230952901291041,0.08770018558785676,0.2719402480140563,0.08602481376955884,0.8763149744964053,-)|
> blockSize=64|
> |(linReg_2de48672cf40,64,10,7956,12.600287516883563,11,[-1.1288062767198885,8.677674008655007,9.388511222751507,8.557805348264019,34.24136626551386,26.9649)|
> |(linReg_a4ed072bdf00,64,50,14423,1.7265674032944005,51,[-1.240937585330031,-0.36565823041213286,1.02717419529322,-0.5264376482700692,-1.5444634018412484,0.)|
> |(linReg_ed9bf8e6db3d,64,100,22680,0.7508904951409897,101,[-0.39923222418441695,0.2591603128603928,0.025707538173424214,0.06178131424518882,1.3651702157456522)|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32119:
--

 Summary: ExecutorPlugin doesn't work with Standalone Cluster
 Key: SPARK-32119
 URL: https://issues.apache.org/jira/browse/SPARK-32119
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
manager too except YARN. ) 
when a jar which contains plugins and files used by the plugins are added by 
--jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on 
Executor initialization.
I confirmed it works **with YARN because jars/files are distributed as 
distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32119:
---
Description: 
ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
manager too except YARN. ) 
 when a jar which contains plugins and files used by the plugins are added by 
--jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on 
Executor initialization.
 I confirmed it works with YARN because jars/files are distributed as 
distributed cache.

  was:
ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
manager too except YARN. ) 
when a jar which contains plugins and files used by the plugins are added by 
--jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on 
Executor initialization.
I confirmed it works **with YARN because jars/files are distributed as 
distributed cache.


> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147413#comment-17147413
 ] 

Dongjoon Hyun commented on SPARK-32115:
---

Thank you, @Yuanjian Li .

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 2.3.4

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32119:


Assignee: Kousuke Saruta  (was: Apache Spark)

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 2.4.6

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147414#comment-17147414
 ] 

Apache Spark commented on SPARK-32119:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28939

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 2.2.3

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32119:


Assignee: Apache Spark  (was: Kousuke Saruta)

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-32120:
-

 Summary: Single GPU is allocated multiple times
 Key: SPARK-32120
 URL: https://issues.apache.org/jira/browse/SPARK-32120
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Enrico Minack


Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task 
and executor and two GPUs provided through a GPU discovery script, the same GPU 
is allocated to both executors.

Discovery script output:
{code}
{"name": "gpu", "addresses": ["0", "1"]}
{code}

Spark local cluster setup through `spark-shell`:
{code}
./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" 
--conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
--conf spark.executor.resource.gpu.amount=1
{code}

Executor of this cluster:

Code run in the Spark shell:
{code}
scala> import org.apache.spark.TaskContext
import org.apache.spark.TaskContext

scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
v.addresses)).iterator }
fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]

scala> spark.range(0,2,1,2).mapPartitions(fn).collect
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
(gpu,(gpu,Array(1
{code}

The result shows that each task got GPU {{1}}. The executor page shows that 
each task has been run on different executors:


The expected behaviour would have been to have GPU `0` assigned to one executor 
and GPU {{1}} to the other executor. Consequently, each partition / task should 
then see a different GPU.

With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
Spark shell setup):
{code}
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
(gpu,(gpu,Array(1
{code}


Happy to contribute a patch if this is an accepted bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-32120:
--
Attachment: screenshot-1.png

> Single GPU is allocated multiple times
> --
>
> Key: SPARK-32120
> URL: https://issues.apache.org/jira/browse/SPARK-32120
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task 
> and executor and two GPUs provided through a GPU discovery script, the same 
> GPU is allocated to both executors.
> Discovery script output:
> {code}
> {"name": "gpu", "addresses": ["0", "1"]}
> {code}
> Spark local cluster setup through `spark-shell`:
> {code}
> ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master 
> "local-cluster[2,1,1024]" --conf 
> spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
> spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
> --conf spark.executor.resource.gpu.amount=1
> {code}
> Executor of this cluster:
> Code run in the Spark shell:
> {code}
> scala> import org.apache.spark.TaskContext
> import org.apache.spark.TaskContext
> scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
> Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
> v.addresses)).iterator }
> fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]
> scala> spark.range(0,2,1,2).mapPartitions(fn).collect
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
> (gpu,(gpu,Array(1
> {code}
> The result shows that each task got GPU {{1}}. The executor page shows that 
> each task has been run on different executors:
> The expected behaviour would have been to have GPU `0` assigned to one 
> executor and GPU {{1}} to the other executor. Consequently, each partition / 
> task should then see a different GPU.
> With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
> Spark shell setup):
> {code}
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
> (gpu,(gpu,Array(1
> {code}
> Happy to contribute a patch if this is an accepted bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 2.1.3

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 2.0.2

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-32120:
--
Attachment: screenshot-2.png

> Single GPU is allocated multiple times
> --
>
> Key: SPARK-32120
> URL: https://issues.apache.org/jira/browse/SPARK-32120
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task 
> and executor and two GPUs provided through a GPU discovery script, the same 
> GPU is allocated to both executors.
> Discovery script output:
> {code}
> {"name": "gpu", "addresses": ["0", "1"]}
> {code}
> Spark local cluster setup through `spark-shell`:
> {code}
> ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master 
> "local-cluster[2,1,1024]" --conf 
> spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
> spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
> --conf spark.executor.resource.gpu.amount=1
> {code}
> Executor of this cluster:
> Code run in the Spark shell:
> {code}
> scala> import org.apache.spark.TaskContext
> import org.apache.spark.TaskContext
> scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
> Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
> v.addresses)).iterator }
> fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]
> scala> spark.range(0,2,1,2).mapPartitions(fn).collect
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
> (gpu,(gpu,Array(1
> {code}
> The result shows that each task got GPU {{1}}. The executor page shows that 
> each task has been run on different executors:
> The expected behaviour would have been to have GPU `0` assigned to one 
> executor and GPU {{1}} to the other executor. Consequently, each partition / 
> task should then see a different GPU.
> With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
> Spark shell setup):
> {code}
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
> (gpu,(gpu,Array(1
> {code}
> Happy to contribute a patch if this is an accepted bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Labels: correctness  (was: )

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-32120:
--
Attachment: screenshot-3.png

> Single GPU is allocated multiple times
> --
>
> Key: SPARK-32120
> URL: https://issues.apache.org/jira/browse/SPARK-32120
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png
>
>
> Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task 
> and executor and two GPUs provided through a GPU discovery script, the same 
> GPU is allocated to both executors.
> Discovery script output:
> {code}
> {"name": "gpu", "addresses": ["0", "1"]}
> {code}
> Spark local cluster setup through `spark-shell`:
> {code}
> ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master 
> "local-cluster[2,1,1024]" --conf 
> spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
> spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
> --conf spark.executor.resource.gpu.amount=1
> {code}
> Executor of this cluster:
> Code run in the Spark shell:
> {code}
> scala> import org.apache.spark.TaskContext
> import org.apache.spark.TaskContext
> scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
> Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
> v.addresses)).iterator }
> fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]
> scala> spark.range(0,2,1,2).mapPartitions(fn).collect
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
> (gpu,(gpu,Array(1
> {code}
> The result shows that each task got GPU {{1}}. The executor page shows that 
> each task has been run on different executors:
> The expected behaviour would have been to have GPU `0` assigned to one 
> executor and GPU {{1}} to the other executor. Consequently, each partition / 
> task should then see a different GPU.
> With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
> Spark shell setup):
> {code}
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
> (gpu,(gpu,Array(1
> {code}
> Happy to contribute a patch if this is an accepted bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Affects Version/s: 1.6.3

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147416#comment-17147416
 ] 

Dongjoon Hyun commented on SPARK-32115:
---

I also verified that this is a long standing bug at 1.6.3 ~ 3.0.0 and Apache 
Hive 2.3.7 has no problem.
{code}
hive> SELECT SUBSTRING("abc", -1207959552, -1207959552);
OK
Time taken: 4.291 seconds, Fetched: 1 row(s)
{code}

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Target Version/s: 2.4.7, 3.0.1

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147417#comment-17147417
 ] 

Dongjoon Hyun commented on SPARK-32115:
---

Although this might be a rare case, but I raise this issue as a Blocker issue 
because this is a correctness issue .

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-32120:
--
Description: 
I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, 
task and executor, and two GPUs provided through a GPU discovery script. The 
same GPU is allocated to both executors.

Discovery script output:
{code:java}
{"name": "gpu", "addresses": ["0", "1"]}
{code}
Spark local cluster setup through {{spark-shell}}:
{code:java}
./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" 
--conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
--conf spark.executor.resource.gpu.amount=1
{code}
Executor page of this cluster:
 !screenshot-2.png!

You can see that both executors have the same GPU allocated: {{[1]}}

Code run in the Spark shell:
{code:java}
scala> import org.apache.spark.TaskContext
import org.apache.spark.TaskContext

scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
v.addresses)).iterator }
fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]

scala> spark.range(0,2,1,2).mapPartitions(fn).collect
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
(gpu,(gpu,Array(1
{code}
The result shows that each task got GPU {{1}}. The executor page shows that 
each task has been run on different executors (see above screenshot).

The expected behaviour would have been to have GPU {{0}} assigned to one 
executor and GPU {{1}} to the other executor. Consequently, each partition / 
task should then see a different GPU.

With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
Spark shell setup):
{code:java}
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
(gpu,(gpu,Array(1
{code}
!screenshot-3.png!

Happy to contribute a patch if this is an accepted bug.

  was:
Running spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, task 
and executor and two GPUs provided through a GPU discovery script, the same GPU 
is allocated to both executors.

Discovery script output:
{code}
{"name": "gpu", "addresses": ["0", "1"]}
{code}

Spark local cluster setup through `spark-shell`:
{code}
./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master "local-cluster[2,1,1024]" 
--conf spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
--conf spark.executor.resource.gpu.amount=1
{code}

Executor of this cluster:

Code run in the Spark shell:
{code}
scala> import org.apache.spark.TaskContext
import org.apache.spark.TaskContext

scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
v.addresses)).iterator }
fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]

scala> spark.range(0,2,1,2).mapPartitions(fn).collect
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
(gpu,(gpu,Array(1
{code}

The result shows that each task got GPU {{1}}. The executor page shows that 
each task has been run on different executors:


The expected behaviour would have been to have GPU `0` assigned to one executor 
and GPU {{1}} to the other executor. Consequently, each partition / task should 
then see a different GPU.

With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
Spark shell setup):
{code}
res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
(gpu,(gpu,Array(1
{code}


Happy to contribute a patch if this is an accepted bug.


> Single GPU is allocated multiple times
> --
>
> Key: SPARK-32120
> URL: https://issues.apache.org/jira/browse/SPARK-32120
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
> Attachments: screenshot-2.png, screenshot-3.png
>
>
> I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, 
> task and executor, and two GPUs provided through a GPU discovery script. The 
> same GPU is allocated to both executors.
> Discovery script output:
> {code:java}
> {"name": "gpu", "addresses": ["0", "1"]}
> {code}
> Spark local cluster setup through {{spark-shell}}:
> {code:java}
> ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master 
> "local-cluster[2,1,1024]" --conf 
> spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
> spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
> --conf spark.executor.resource.gpu.amount=1
> {code}
> Executor page of this cluster:
> 

[jira] [Updated] (SPARK-32120) Single GPU is allocated multiple times

2020-06-28 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-32120:
--
Attachment: (was: screenshot-1.png)

> Single GPU is allocated multiple times
> --
>
> Key: SPARK-32120
> URL: https://issues.apache.org/jira/browse/SPARK-32120
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
> Attachments: screenshot-2.png, screenshot-3.png
>
>
> I am running Spark in a {{local-cluster[2,1,1024]}} with one GPU per worker, 
> task and executor, and two GPUs provided through a GPU discovery script. The 
> same GPU is allocated to both executors.
> Discovery script output:
> {code:java}
> {"name": "gpu", "addresses": ["0", "1"]}
> {code}
> Spark local cluster setup through {{spark-shell}}:
> {code:java}
> ./spark-3.0.0-bin-hadoop2.7/bin/spark-shell --master 
> "local-cluster[2,1,1024]" --conf 
> spark.worker.resource.gpu.discoveryScript=/tmp/gpu.json --conf 
> spark.worker.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 
> --conf spark.executor.resource.gpu.amount=1
> {code}
> Executor page of this cluster:
>  !screenshot-2.png!
> You can see that both executors have the same GPU allocated: {{[1]}}
> Code run in the Spark shell:
> {code:java}
> scala> import org.apache.spark.TaskContext
> import org.apache.spark.TaskContext
> scala> def fn(it: Iterator[java.lang.Long]): Iterator[(String, (String, 
> Array[String]))] = { TaskContext.get().resources().mapValues(v => (v.name, 
> v.addresses)).iterator }
> fn: (it: Iterator[Long])Iterator[(String, (String, Array[String]))]
> scala> spark.range(0,2,1,2).mapPartitions(fn).collect
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(1))), 
> (gpu,(gpu,Array(1
> {code}
> The result shows that each task got GPU {{1}}. The executor page shows that 
> each task has been run on different executors (see above screenshot).
> The expected behaviour would have been to have GPU {{0}} assigned to one 
> executor and GPU {{1}} to the other executor. Consequently, each partition / 
> task should then see a different GPU.
> With Spark 3.0.0-preview2 the allocation was as expected (identical code and 
> Spark shell setup):
> {code:java}
> res0: Array[(String, (String, Array[String]))] = Array((gpu,(gpu,Array(0))), 
> (gpu,(gpu,Array(1
> {code}
> !screenshot-3.png!
> Happy to contribute a patch if this is an accepted bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32115:
--
Priority: Blocker  (was: Major)

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32109) SQL hash function handling of nulls makes collision too likely

2020-06-28 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147358#comment-17147358
 ] 

koert kuipers edited comment on SPARK-32109 at 6/28/20, 2:58 PM:
-

the issue is that row here isnt really a sequence. it represent an object.

if you have say an object Person(name: String, nickname: String) you would not 
want Person("john", null) and Person(null, "john") to have same hashCode.

see for example the suggested hashcode implementations in effective java by 
joshua bloch. they do something similar to what you suggest to solve this 
problem. so unfortunately i think our current implementation is flawed :(

p.s. even for pure sequences i do not think this implementation as it is right 
now is acceptable. but that is less of a worry than the object represenation of 
row.


was (Author: koert):
the issue is that Row here isnt really a sequence. it represent an object.

if you have say an object Person(name: String, nickname: String) you would not 
want Person("john", null) and Person(null, "john") to have same hashCode.

see for example the suggested hashcode implementations in effective java by 
joshua bloch. they do something similar to what you suggest to solve this 
problem. so unfortunately i think our current implementation is flawed :(

PS even for pure sequences i do not think this implementation as it is right 
now is acceptable. but that is less of a worry than the object represenation of 
row.

> SQL hash function handling of nulls makes collision too likely
> --
>
> Key: SPARK-32109
> URL: https://issues.apache.org/jira/browse/SPARK-32109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> this ticket is about org.apache.spark.sql.functions.hash and sparks handling 
> of nulls when hashing sequences.
> {code:java}
> scala> spark.sql("SELECT hash('bar', null)").show()
> +---+
> |hash(bar, NULL)|
> +---+
> |-1808790533|
> +---+
> scala> spark.sql("SELECT hash(null, 'bar')").show()
> +---+
> |hash(NULL, bar)|
> +---+
> |-1808790533|
> +---+
>  {code}
> these are differences sequences. e.g. these could be positions 0 and 1 in a 
> dataframe which are diffferent columns with entirely different meanings. the 
> hashes should not be the same.
> another example:
> {code:java}
> scala> Seq(("john", null), (null, "john")).toDF("name", 
> "alias").withColumn("hash", hash(col("name"), col("alias"))).show
> ++-+-+
> |name|alias| hash|
> ++-+-+
> |john| null|487839701|
> |null| john|487839701|
> ++-+-+ {code}
> instead of ignoring nulls each null show do a transform to the hash so that 
> the order of elements including the nulls matters for the outcome.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32109) SQL hash function handling of nulls makes collision too likely

2020-06-28 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147358#comment-17147358
 ] 

koert kuipers commented on SPARK-32109:
---

the issue is that Row here isnt really a sequence. it represent an object.

if you have say an object Person(name: String, nickname: String) you would not 
want Person("john", null) and Person(null, "john") to have same hashCode.

see for example the suggested hashcode implementations in effective java by 
joshua bloch. they do something similar to what you suggest to solve this 
problem. so unfortunately i think our current implementation is flawed :(

PS even for pure sequences i do not think this implementation as it is right 
now is acceptable. but that is less of a worry than the object represenation of 
row.

> SQL hash function handling of nulls makes collision too likely
> --
>
> Key: SPARK-32109
> URL: https://issues.apache.org/jira/browse/SPARK-32109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> this ticket is about org.apache.spark.sql.functions.hash and sparks handling 
> of nulls when hashing sequences.
> {code:java}
> scala> spark.sql("SELECT hash('bar', null)").show()
> +---+
> |hash(bar, NULL)|
> +---+
> |-1808790533|
> +---+
> scala> spark.sql("SELECT hash(null, 'bar')").show()
> +---+
> |hash(NULL, bar)|
> +---+
> |-1808790533|
> +---+
>  {code}
> these are differences sequences. e.g. these could be positions 0 and 1 in a 
> dataframe which are diffferent columns with entirely different meanings. the 
> hashes should not be the same.
> another example:
> {code:java}
> scala> Seq(("john", null), (null, "john")).toDF("name", 
> "alias").withColumn("hash", hash(col("name"), col("alias"))).show
> ++-+-+
> |name|alias| hash|
> ++-+-+
> |john| null|487839701|
> |null| john|487839701|
> ++-+-+ {code}
> instead of ignoring nulls each null show do a transform to the hash so that 
> the order of elements including the nulls matters for the outcome.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on windows

2020-06-28 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-32121:
-

 Summary: ExternalShuffleBlockResolverSuite failed on windows
 Key: SPARK-32121
 URL: https://issues.apache.org/jira/browse/SPARK-32121
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.0.0, 3.0.1
 Environment: Windows 10
Reporter: Cheng Pan


The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
should consider the Windows file separator.
{code}
[ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 s 
<<< FAILURE! - in 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
[ERROR] 
testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
  Time elapsed: 0 s  <<< FAILURE!
org.junit.ComparisonFailure: expected: but was:
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
at 
org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32115.
---
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28937
[https://github.com/apache/spark/pull/28937]

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32115) Incorrect results for SUBSTRING when overflow

2020-06-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32115:
-

Assignee: Yuanjian Li

> Incorrect results for SUBSTRING when overflow
> -
>
> Key: SPARK-32115
> URL: https://issues.apache.org/jira/browse/SPARK-32115
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Blocker
>  Labels: correctness
>
> SQL query SELECT SUBSTRING("abc", -1207959552, -1207959552) incorrectly 
> returns "abc" against expected output of "".
>  This is a result of integer overflow in addition 
> [https://github.com/apache/spark/blob/8c44d744631516a5cdaf63406e69a9dd11e5b878/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L345]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-06-28 Thread Cheng Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Pan updated SPARK-32121:
--
Summary: ExternalShuffleBlockResolverSuite failed on Windows  (was: 
ExternalShuffleBlockResolverSuite failed on windows)

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Priority: Minor
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147422#comment-17147422
 ] 

Dongjoon Hyun commented on SPARK-32119:
---

Hi, [~sarutak]. This sounds like a bug for `Standalone Cluster`. Can we switch 
this to `BUG` instead of `Improvement` for 3.1.0?

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32119:
---
Issue Type: Bug  (was: Improvement)

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147424#comment-17147424
 ] 

Apache Spark commented on SPARK-32121:
--

User 'pan3793' has created a pull request for this issue:
https://github.com/apache/spark/pull/28940

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Priority: Minor
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32121:


Assignee: Apache Spark

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Minor
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32119:
---
Affects Version/s: 3.0.1

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster

2020-06-28 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147427#comment-17147427
 ] 

Kousuke Saruta commented on SPARK-32119:


Sorry, it's just a mistake. I've modified it.

> ExecutorPlugin doesn't work with Standalone Cluster
> ---
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> ExecutorPlugin can't work with Standalone Cluster (maybe with other cluster 
> manager too except YARN. ) 
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-06-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32121:


Assignee: (was: Apache Spark)

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Priority: Minor
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147434#comment-17147434
 ] 

Apache Spark commented on SPARK-25341:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28941

> Support rolling back a shuffle map stage and re-generate the shuffle files
> --
>
> Key: SPARK-25341
> URL: https://issues.apache.org/jira/browse/SPARK-25341
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
> To completely fix that problem, Spark needs to be able to rollback a shuffle 
> map stage and rerun all the map tasks.
> According to https://github.com/apache/spark/pull/9214 , Spark doesn't 
> support it currently, as in shuffle writing "first write wins".
> Since overwriting shuffle files is hard, we can extend the shuffle id to 
> include a "shuffle generation number". Then the reduce task can specify which 
> generation of shuffle it wants to read. 
> https://github.com/apache/spark/pull/6648 seems in the right direction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files

2020-06-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147435#comment-17147435
 ] 

Apache Spark commented on SPARK-25341:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/28941

> Support rolling back a shuffle map stage and re-generate the shuffle files
> --
>
> Key: SPARK-25341
> URL: https://issues.apache.org/jira/browse/SPARK-25341
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243
> To completely fix that problem, Spark needs to be able to rollback a shuffle 
> map stage and rerun all the map tasks.
> According to https://github.com/apache/spark/pull/9214 , Spark doesn't 
> support it currently, as in shuffle writing "first write wins".
> Since overwriting shuffle files is hard, we can extend the shuffle id to 
> include a "shuffle generation number". Then the reduce task can specify which 
> generation of shuffle it wants to read. 
> https://github.com/apache/spark/pull/6648 seems in the right direction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org