[jira] [Commented] (SPARK-30049) SQL fails to parse when comment contains an unmatched quote character

2020-01-22 Thread Oleg Bonar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021832#comment-17021832
 ] 

Oleg Bonar commented on SPARK-30049:


[~tgraves], no, i'm not.

 

> SQL fails to parse when comment contains an unmatched quote character
> -
>
> Key: SPARK-30049
> URL: https://issues.apache.org/jira/browse/SPARK-30049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
> Attachments: Screen Shot 2019-12-18 at 9.26.29 AM.png
>
>
> A SQL statement that contains a comment with an unmatched quote character can 
> lead to a parse error.  These queries parsed correctly in older versions of 
> Spark.  For example, here's an excerpt from an interactive spark-sql session 
> on a recent Spark-3.0.0-SNAPSHOT build (commit 
> e23c135e568d4401a5659bc1b5ae8fc8bf147693):
> {noformat}
> spark-sql> SELECT 1 -- someone's comment here
>  > ;
> Error in query: 
> extraneous input ';' expecting (line 2, pos 0)
> == SQL ==
> SELECT 1 -- someone's comment here
> ;
> ^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30607) overlay wrappers for SparkR and PySpark

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30607.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27325
[https://github.com/apache/spark/pull/27325]

> overlay wrappers for SparkR and PySpark
> ---
>
> Key: SPARK-30607
> URL: https://issues.apache.org/jira/browse/SPARK-30607
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.0.0
>
>
> SparkR and PySpark are missing wrappers for {{o.a.s.sql.functions.overlay.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30607) overlay wrappers for SparkR and PySpark

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30607:


Assignee: Maciej Szymkiewicz

> overlay wrappers for SparkR and PySpark
> ---
>
> Key: SPARK-30607
> URL: https://issues.apache.org/jira/browse/SPARK-30607
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
>
> SparkR and PySpark are missing wrappers for {{o.a.s.sql.functions.overlay.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021823#comment-17021823
 ] 

Hyukjin Kwon commented on SPARK-30590:
--

Ah, thanks. I rushed to read. This issue still persists in the master as well.

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Daniel Mantovani
>Priority: Major
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
>  at 
> 

[jira] [Updated] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30590:
-
Affects Version/s: 3.0.0

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Daniel Mantovani
>Priority: Major
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430)

[jira] [Assigned] (SPARK-30601) Add a Google Maven Central as a primary repository

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30601:


Assignee: Hyukjin Kwon

> Add a Google Maven Central as a primary repository
> --
>
> Key: SPARK-30601
> URL: https://issues.apache.org/jira/browse/SPARK-30601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html]
> This Jira targets to switch the main repo to Google Maven Central.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30601) Add a Google Maven Central as a primary repository

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30601.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27307
[https://github.com/apache/spark/pull/27307]

> Add a Google Maven Central as a primary repository
> --
>
> Key: SPARK-30601
> URL: https://issues.apache.org/jira/browse/SPARK-30601
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> See 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html]
> This Jira targets to switch the main repo to Google Maven Central.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30535) Migrate ALTER TABLE commands to the new resolution framework

2020-01-22 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-30535:
-
  Assignee: (was: Terry Kim)

> Migrate ALTER TABLE commands to the new resolution framework
> 
>
> Key: SPARK-30535
> URL: https://issues.apache.org/jira/browse/SPARK-30535
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Migrate ALTER TABLE commands to the new resolution framework introduced in 
> SPARK-30214



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30546) Make interval type more future-proofing

2020-01-22 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021806#comment-17021806
 ] 

Kent Yao commented on SPARK-30546:
--

Thanks [~dongjoon]

> Make interval type more future-proofing
> ---
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Before 3.0 we may make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same as
> using CSV file format or make it fully support as normal strings
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore or make it fully support as normal strings
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30616) Introduce TTL config option for SQL Parquet Metadata Cache

2020-01-22 Thread Yaroslav Tkachenko (Jira)
Yaroslav Tkachenko created SPARK-30616:
--

 Summary: Introduce TTL config option for SQL Parquet Metadata Cache
 Key: SPARK-30616
 URL: https://issues.apache.org/jira/browse/SPARK-30616
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Yaroslav Tkachenko


>From 
>[documentation|https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing]:
{quote}Spark SQL caches Parquet metadata for better performance. When Hive 
metastore Parquet table conversion is enabled, metadata of those converted 
tables are also cached. If these tables are updated by Hive or other external 
tools, you need to refresh them manually to ensure consistent metadata.
{quote}
Unfortunately simply submitting "REFRESH TABLE"  commands could be very 
cumbersome. Assuming frequently generated new Parquet files, hundreds of tables 
and dozens of users querying the data (and expecting up-to-date results), 
manually refreshing metadata for each table is not an optimal solution. And 
this is a pretty common use-case for streaming ingestion of data.    

I propose to introduce a new option in Spark (something like 
"spark.sql.parquet.metadataCache.refreshInterval") that controls the TTL of 
this metadata cache. Its default value can be pretty high (an hour? a few 
hours?), so it doesn't alter the existing behavior much. When it's set to 0 the 
cache is effectively disabled (could be useful for testing or some edge cases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Daniel Mantovani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021792#comment-17021792
 ] 

Daniel Mantovani edited comment on SPARK-30590 at 1/23/20 5:48 AM:
---

[~hyukjin.kwon] You tried with 5 parameters which works, you should try with 6 
to get the exception:

 
{code:java}
scala> 
df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as 
int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS 
foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), 
None, None, None, input[0, int, false] AS value#129, 
assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as 
int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS 
foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), 
None, None, None, input[0, int, false] AS value#119, 
assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as 
int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS 
foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), 
None, None, None, input[0, int, false] AS value#134, 
assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as 
int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS 
foo_agg_6#141]
{code}
 


was (Author: mantovani):
[~hyukjin.kwon] You tried with 5 parameters which works, you should try with 6 
to get the exception:

 
scala> 
df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show

org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as 
int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS 
foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), 
None, None, None, input[0, int, false] AS value#129, 
assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as 
int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS 
foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), 
None, None, None, input[0, int, false] AS value#119, 
assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as 
int)), input[0, int, 

[jira] [Reopened] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Daniel Mantovani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Mantovani reopened SPARK-30590:
--

[~hyukjin.kwon] You tried with 5 parameters which works, you should try with 6 
to get the exception:

 
scala> 
df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show

org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as 
int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS 
foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), 
None, None, None, input[0, int, false] AS value#129, 
assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as 
int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS 
foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), 
None, None, None, input[0, int, false] AS value#119, 
assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as 
int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS 
foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), 
None, None, None, input[0, int, false] AS value#134, 
assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as 
int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS 
foo_agg_6#141]

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Daniel Mantovani
>Priority: Major
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, 

[jira] [Issue Comment Deleted] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Daniel Mantovani (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Mantovani updated SPARK-30590:
-
Comment: was deleted

(was: [~hyukjin.kwon] You tried the wrong thing, with 5 parameters works!

 

This doesn't work:
{code:scala}
scala> 
df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show

org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as 
int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS 
foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), 
None, None, None, input[0, int, false] AS value#129, 
assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as 
int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS 
foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), 
None, None, None, input[0, int, false] AS value#119, 
assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as 
int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS 
foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), 
None, None, None, input[0, int, false] AS value#134, 
assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as 
int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS 
foo_agg_6#141]
+- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
e#17, _6#11 AS F#18]
 +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430)
 at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:430)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)

 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
 at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
 ... 50 elided
{code}
 )

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  

[jira] [Commented] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Daniel Mantovani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021791#comment-17021791
 ] 

Daniel Mantovani commented on SPARK-30590:
--

[~hyukjin.kwon] You tried the wrong thing, with 5 parameters works!

 

This doesn't work:
{code:scala}
scala> 
df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show

org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
[fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 as 
int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) AS 
foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, fooagg(FooAgg(4), 
None, None, None, input[0, int, false] AS value#129, 
assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 as 
int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) AS 
foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), 
None, None, None, input[0, int, false] AS value#119, 
assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 as 
int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) AS 
foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), 
None, None, None, input[0, int, false] AS value#134, 
assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 as 
int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) AS 
foo_agg_6#141]
+- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
e#17, _6#11 AS F#18]
 +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]

at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430)
 at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:430)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)

 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
 at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
 at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
 ... 50 elided
{code}
 

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>

[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable

2020-01-22 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021783#comment-17021783
 ] 

Terry Kim commented on SPARK-30615:
---

[~cloud_fan] Yes, I will work on this. Thanks!

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-22 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021781#comment-17021781
 ] 

Terry Kim commented on SPARK-30614:
---

[~cloud_fan] Yes, I will work on this. Thanks!

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30360) Avoid Redact classpath entries in History Server UI

2020-01-22 Thread Ajith S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-30360:

Description: Currently SPARK history server display the classpath entries 
in the Environment tab with classpaths redacted, this is because EventLog file 
has the entry values redacted while writing. But when same is seen from a 
running application UI, its seen that it is not redacted. Classpath entries 
redact is not needed and can be avoided  (was: Currently SPARK history server 
display the classpath entries in the Environment tab with classpaths redacted, 
this is because EventLog file has the entry values redacted while writing. But 
when same is seen from a running application UI, its seen that it is not 
redacted. )

> Avoid Redact classpath entries in History Server UI
> ---
>
> Key: SPARK-30360
> URL: https://issues.apache.org/jira/browse/SPARK-30360
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently SPARK history server display the classpath entries in the 
> Environment tab with classpaths redacted, this is because EventLog file has 
> the entry values redacted while writing. But when same is seen from a running 
> application UI, its seen that it is not redacted. Classpath entries redact is 
> not needed and can be avoided



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30360) Avoid Redact classpath entries in History Server UI

2020-01-22 Thread Ajith S (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-30360:

Summary: Avoid Redact classpath entries in History Server UI  (was: Redact 
classpath entries in Spark UI)

> Avoid Redact classpath entries in History Server UI
> ---
>
> Key: SPARK-30360
> URL: https://issues.apache.org/jira/browse/SPARK-30360
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently SPARK history server display the classpath entries in the 
> Environment tab with classpaths redacted, this is because EventLog file has 
> the entry values redacted while writing. But when same is seen from a running 
> application UI, its seen that it is not redacted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2020-01-22 Thread Jim Kleckner (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021766#comment-17021766
 ] 

Jim Kleckner commented on SPARK-30275:
--

I sent a message to [d...@spark.apache.org|mailto:d...@spark.apache.org] but 
haven't seen it get approved yet.

> Add gitlab-ci.yml file for reproducible builds
> --
>
> Key: SPARK-30275
> URL: https://issues.apache.org/jira/browse/SPARK-30275
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jim Kleckner
>Priority: Minor
>
> It would be desirable to have public reproducible builds such as provided by 
> gitlab or others.
>  
> Here is a candidate patch set to build spark using gitlab-ci:
> * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27871) LambdaVariable should use per-query unique IDs instead of globally unique IDs

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021765#comment-17021765
 ] 

Wenchen Fan commented on SPARK-27871:
-

This is done with an optimizer rule so we added a public config. Maybe we can 
remove the conf, as users can exclude any optimizer rule with a config already.

> LambdaVariable should use per-query unique IDs instead of globally unique IDs
> -
>
> Key: SPARK-27871
> URL: https://issues.apache.org/jira/browse/SPARK-27871
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30615) normalize the column name in AlterTable

2020-01-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30615:

Description: Because of case insensitive resolution, the column name in 
AlterTable may match the table schema but not exactly the same. To ease DS v2 
implementations, Spark should normalize the column name before passing them to 
v2 catalogs, so that users don't need to care about the case sensitive config.  
(was: Because of case insensitive resolution, the column name in AlterTable may 
match the table schema but not exactly the same. The ease DS v2 
implementations, Spark should normalize the column name before passing them to 
v2 catalogs, so that users don't need to care about the case sensitive config.)

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. To ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30615) normalize the column name in AlterTable

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021763#comment-17021763
 ] 

Wenchen Fan commented on SPARK-30615:
-

Hi [~imback82], do you have time to work on it? Thanks!

> normalize the column name in AlterTable
> ---
>
> Key: SPARK-30615
> URL: https://issues.apache.org/jira/browse/SPARK-30615
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Because of case insensitive resolution, the column name in AlterTable may 
> match the table schema but not exactly the same. The ease DS v2 
> implementations, Spark should normalize the column name before passing them 
> to v2 catalogs, so that users don't need to care about the case sensitive 
> config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30615) normalize the column name in AlterTable

2020-01-22 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30615:
---

 Summary: normalize the column name in AlterTable
 Key: SPARK-30615
 URL: https://issues.apache.org/jira/browse/SPARK-30615
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


Because of case insensitive resolution, the column name in AlterTable may match 
the table schema but not exactly the same. The ease DS v2 implementations, 
Spark should normalize the column name before passing them to v2 catalogs, so 
that users don't need to care about the case sensitive config.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021761#comment-17021761
 ] 

Wenchen Fan edited comment on SPARK-30614 at 1/23/20 4:22 AM:
--

Hi [~imback82], do you have time to work on it? thanks!


was (Author: cloud_fan):
Hi [~imback82], do you have time to work in it? thanks!

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021761#comment-17021761
 ] 

Wenchen Fan commented on SPARK-30614:
-

Hi [~imback82], do you have time to work in it? thanks!

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-22 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30614:
---

 Summary: The native ALTER COLUMN syntax should change one thing at 
a time
 Key: SPARK-30614
 URL: https://issues.apache.org/jira/browse/SPARK-30614
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the SQL 
standard.

{{{
ALTER TABLE table=multipartIdentifier
  (ALTER | CHANGE) COLUMN? column=multipartIdentifier
  (TYPE dataType)?
  (COMMENT comment=STRING)?
  colPosition?   
}}}

The SQL standard (section 11.12) only allows changing one property at a time. 
This is also true on other recent SQL systems like 
snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
 and redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)

The snowflake has an extension that it allows changing multiple columns at a 
time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the SQL 
standard, I think this syntax is better. 

For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-01-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30614:

Description: 
Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the SQL 
standard.

{code}
ALTER TABLE table=multipartIdentifier
  (ALTER | CHANGE) COLUMN? column=multipartIdentifier
  (TYPE dataType)?
  (COMMENT comment=STRING)?
  colPosition?   
{code}

The SQL standard (section 11.12) only allows changing one property at a time. 
This is also true on other recent SQL systems like 
snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
 and redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)

The snowflake has an extension that it allows changing multiple columns at a 
time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the SQL 
standard, I think this syntax is better. 

For now, let's be conservative and only allow changing one property at a time.

  was:
Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the SQL 
standard.

{{{
ALTER TABLE table=multipartIdentifier
  (ALTER | CHANGE) COLUMN? column=multipartIdentifier
  (TYPE dataType)?
  (COMMENT comment=STRING)?
  colPosition?   
}}}

The SQL standard (section 11.12) only allows changing one property at a time. 
This is also true on other recent SQL systems like 
snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
 and redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)

The snowflake has an extension that it allows changing multiple columns at a 
time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the SQL 
standard, I think this syntax is better. 

For now, let's be conservative and only allow changing one property at a time.


> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-01-22 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021753#comment-17021753
 ] 

Terry Kim commented on SPARK-30613:
---

[~cloud_fan] Yes, I will work on this. Thanks!

> support hive style REPLACE COLUMN syntax
> 
>
> Key: SPARK-30613
> URL: https://issues.apache.org/jira/browse/SPARK-30613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> We already support the hive style CHANGE COLUMN syntax, I think it's better 
> to also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-22 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021752#comment-17021752
 ] 

Terry Kim commented on SPARK-30612:
---

[~cloud_fan] Yes, I will work on this.

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021737#comment-17021737
 ] 

Wenchen Fan commented on SPARK-30613:
-

Hi [~imback82] do you have time to work on it? Thanks!

> support hive style REPLACE COLUMN syntax
> 
>
> Key: SPARK-30613
> URL: https://issues.apache.org/jira/browse/SPARK-30613
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> We already support the hive style CHANGE COLUMN syntax, I think it's better 
> to also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30613) support hive style REPLACE COLUMN syntax

2020-01-22 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30613:
---

 Summary: support hive style REPLACE COLUMN syntax
 Key: SPARK-30613
 URL: https://issues.apache.org/jira/browse/SPARK-30613
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


We already support the hive style CHANGE COLUMN syntax, I think it's better to 
also support hive style REPLACE COLUMN syntax. Please refer to the doc: 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-22 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021729#comment-17021729
 ] 

Wenchen Fan commented on SPARK-30612:
-

Hi [~imback82] do you have time to work on it? Thanks

> can't resolve qualified column name with v2 tables
> --
>
> Key: SPARK-30612
> URL: https://issues.apache.org/jira/browse/SPARK-30612
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> When running queries with qualified columns like `SELECT t.a FROM t`, it 
> fails to resolve for v2 tables.
> v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
> should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30612) can't resolve qualified column name with v2 tables

2020-01-22 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-30612:
---

 Summary: can't resolve qualified column name with v2 tables
 Key: SPARK-30612
 URL: https://issues.apache.org/jira/browse/SPARK-30612
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan


When running queries with qualified columns like `SELECT t.a FROM t`, it fails 
to resolve for v2 tables.

v1 table is fine as we always wrap the v1 relation with a `SubqueryAlias`. We 
should do the same for v2 tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30609) Allow default merge command resolution to be bypassed by DSv2 sources

2020-01-22 Thread Tathagata Das (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-30609.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27326
[https://github.com/apache/spark/pull/27326]

> Allow default merge command resolution to be bypassed by DSv2 sources
> -
>
> Key: SPARK-30609
> URL: https://issues.apache.org/jira/browse/SPARK-30609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> Problem: Some DSv2 sources may want to customize the merge resolution logic. 
> For example, a table that can accept any schema 
> (TableCapability.ACCEPT_ANY_SCHEMA) may want to allow certain merge queries 
> that are blocked (that is, throws AnalysisError) by the default resolution 
> logic. So there should be a way to completely bypass the merge resolution 
> logic in the Analyzer. 
> Potential solution: Skip resolving the merge expressions if the target is a 
> DSv2 table with  ACCEPT_ANY_SCHEMA capability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30609) Allow default merge command resolution to be bypassed by DSv2 sources

2020-01-22 Thread Tathagata Das (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-30609:
-

Assignee: Tathagata Das

> Allow default merge command resolution to be bypassed by DSv2 sources
> -
>
> Key: SPARK-30609
> URL: https://issues.apache.org/jira/browse/SPARK-30609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Problem: Some DSv2 sources may want to customize the merge resolution logic. 
> For example, a table that can accept any schema 
> (TableCapability.ACCEPT_ANY_SCHEMA) may want to allow certain merge queries 
> that are blocked (that is, throws AnalysisError) by the default resolution 
> logic. So there should be a way to completely bypass the merge resolution 
> logic in the Analyzer. 
> Potential solution: Skip resolving the merge expressions if the target is a 
> DSv2 table with  ACCEPT_ANY_SCHEMA capability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30476) NullPointerException when Insert data to hive mongo external table by spark-sql

2020-01-22 Thread XiongCheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021709#comment-17021709
 ] 

XiongCheng edited comment on SPARK-30476 at 1/23/20 2:49 AM:
-

[~hyukjin.kwon]Thanks for your reply~.However, my doubt is why 
mapreduce.task.attempt.id is not passed into the HiveOutputWriter, so that 
HiveMongoOutputFormat cannot get mapreduce.task.attempt.id. I also tried to fix 
this myself, extracted mapreduce.task.attempt.id and merged it with 
jobConf.value,It works~


was (Author: bro-xiong):
[~hyukjin.kwon]Thanks for your reply~.However, my doubt is why 
mapreduce.task.attempt.id is not passed into the HiveOutputWriter, so that 
mongo-hadoop cannot get mapreduce.task.attempt.id. I also tried to fix this 
myself, extracted mapreduce.task.attempt.id and merged it with jobConf.value,It 
works~

> NullPointerException when Insert data to hive mongo external table by 
> spark-sql
> ---
>
> Key: SPARK-30476
> URL: https://issues.apache.org/jira/browse/SPARK-30476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: mongo-hadoop: 2.0.2
> spark-version: 2.4.3
> scala-version: 2.11
> hive-version: 1.2.1
> hadoop-version: 2.6.0
>Reporter: XiongCheng
>Priority: Major
>
> I execute the sql,but i got a NPE.
> result_data_mongo is a mongodb hive external table.
> {code:java}
> insert into result_data_mongo 
> values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15);
> {code}
> NPE detail:
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>   at 
> com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264)
>   at 
> com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>   ... 15 more
> {code}
> I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified 
> the source code of mongo-hadoop to get the correct properties 
> ('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get 
> the value. I found that these parameters are stored in spark In 
> TaskAttemptContext, but TaskAttemptContext is not passed into 
> HiveOutputWriter, is this a design flaw?
> here are two key point.
> mongo-hadoop: 
> [https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80]
> 

[jira] [Commented] (SPARK-30476) NullPointerException when Insert data to hive mongo external table by spark-sql

2020-01-22 Thread XiongCheng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021709#comment-17021709
 ] 

XiongCheng commented on SPARK-30476:


[~hyukjin.kwon]Thanks for your reply~.However, my doubt is why 
mapreduce.task.attempt.id is not passed into the HiveOutputWriter, so that 
mongo-hadoop cannot get mapreduce.task.attempt.id. I also tried to fix this 
myself, extracted mapreduce.task.attempt.id and merged it with jobConf.value,It 
works~

> NullPointerException when Insert data to hive mongo external table by 
> spark-sql
> ---
>
> Key: SPARK-30476
> URL: https://issues.apache.org/jira/browse/SPARK-30476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: mongo-hadoop: 2.0.2
> spark-version: 2.4.3
> scala-version: 2.11
> hive-version: 1.2.1
> hadoop-version: 2.6.0
>Reporter: XiongCheng
>Priority: Major
>
> I execute the sql,but i got a NPE.
> result_data_mongo is a mongodb hive external table.
> {code:java}
> insert into result_data_mongo 
> values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15);
> {code}
> NPE detail:
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>   at 
> com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264)
>   at 
> com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>   ... 15 more
> {code}
> I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified 
> the source code of mongo-hadoop to get the correct properties 
> ('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get 
> the value. I found that these parameters are stored in spark In 
> TaskAttemptContext, but TaskAttemptContext is not passed into 
> HiveOutputWriter, is this a design flaw?
> here are two key point.
> mongo-hadoop: 
> [https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80]
> spark-hive:[https://github.com/apache/spark/blob/7c7d7f6a878b02ece881266ee538f3e1443aa8c1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala#L103]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30229) java.lang.NullPointerException at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021704#comment-17021704
 ] 

Hyukjin Kwon commented on SPARK-30229:
--

[~Ankitraj] were you able to reproduce?

[~SeaAndHill] can you show exact code and steps to reproduce this? Ideally 
should be copy-and-paste-able.

> java.lang.NullPointerException at 
> org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)
> -
>
> Key: SPARK-30229
> URL: https://issues.apache.org/jira/browse/SPARK-30229
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: SeaAndHill
>Priority: Major
>
> 2019-12-12 11:52:00 INFO JobScheduler:54 - Added jobs for time 157612272 
> ms
> 2019-12-12 11:52:00 INFO JobScheduler:54 - Starting job streaming job 
> 157612272 ms.0 from job set of time 157612272 ms
> 2019-12-12 11:52:00 INFO CarbonSparkSqlParser:54 - Parsing command: 
> event_detail_temp
> 2019-12-12 11:52:00 INFO CarbonLateDecodeRule:95 - skip CarbonOptimizer
> 2019-12-12 11:52:00 INFO CarbonLateDecodeRule:72 - Skip CarbonOptimizer
> 2019-12-12 11:52:00 INFO CarbonLateDecodeRule:95 - skip CarbonOptimizer
> 2019-12-12 11:52:00 INFO CarbonLateDecodeRule:72 - Skip CarbonOptimizer
> 2019-12-12 11:52:00 INFO JobScheduler:54 - Finished job streaming job 
> 157612272 ms.0 from job set of time 157612272 ms
> 2019-12-12 11:52:00 ERROR JobScheduler:91 - Error running job streaming job 
> 157612272 ms.0
> java.lang.NullPointerException
>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer.currPrefLocs(CoalescedRDD.scala:178)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:196)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations$$anonfun$getAllPrefLocs$2.apply(CoalescedRDD.scala:195)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.getAllPrefLocs(CoalescedRDD.scala:195)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer$PartitionLocations.(CoalescedRDD.scala:188)
>  at 
> org.apache.spark.rdd.DefaultPartitionCoalescer.coalesce(CoalescedRDD.scala:391)
>  at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:91)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)
>  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> 

[jira] [Resolved] (SPARK-30239) Creating a dataframe with Pandas rather than Numpy datatypes fails

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30239.
--
Resolution: Incomplete

Resolving by no feedback from reporter.

> Creating a dataframe with Pandas rather than Numpy datatypes fails
> --
>
> Key: SPARK-30239
> URL: https://issues.apache.org/jira/browse/SPARK-30239
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 
> | Scala 2.11
>Reporter: Philip Kahn
>Priority: Minor
>
> It's possible to work with DataFrames in Pandas and shuffle them back over to 
> Spark dataframes for processing; however, using Pandas extended datatypes 
> like {{Int64 }}( 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) 
> throws an error (that long / float can't be converted).
> This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} 
> allows only integers except for the single float value {{np.nan}}.
>  
> The current workaround for this is to use the columns as floats, and after 
> conversion to the Spark DataFrame, to recast the column as {{LongType()}}. 
> For example:
>  
> {{sdfC = spark.createDataFrame(kgridCLinked)}}
> {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
>  
> However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021701#comment-17021701
 ] 

Hyukjin Kwon commented on SPARK-30275:
--

Can you send an email to the dev list and ask some feedback?
You should probably show and list exact steps and exact benefits by adding it.

> Add gitlab-ci.yml file for reproducible builds
> --
>
> Key: SPARK-30275
> URL: https://issues.apache.org/jira/browse/SPARK-30275
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jim Kleckner
>Priority: Minor
>
> It would be desirable to have public reproducible builds such as provided by 
> gitlab or others.
>  
> Here is a candidate patch set to build spark using gitlab-ci:
> * https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml
> Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30327) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021700#comment-17021700
 ] 

Hyukjin Kwon commented on SPARK-30327:
--

Can you show full, self-contained reproducer?

> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
> ---
>
> Key: SPARK-30327
> URL: https://issues.apache.org/jira/browse/SPARK-30327
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.2
>Reporter: lujun
>Priority: Major
>
> val edgeRdd: RDD[Edge[Int]] = rdd.map(rec => {
>  Edge(rec._2._1.getOldcid, rec._2._1.getNewcid, 0)
>  })
>  val vertexRdd: RDD[(Long, String)] = rdd.map(rec =>{
>  (rec._2._1.getOldcid, rec._2._1.getCustomer_id)} )
>  val returnRdd = Graph(vertexRdd, edgeRdd).connectedComponents().vertices.
>  join(vertexRdd)
>  .map \{ case (cid, (groupid, cus)) => (cus, groupid)}
>  
> For the same batch of data, sometimes it succeeds, and the following errors 
> are reported!
>  
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65)
>  at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 2 in stage 24374.0 failed 4 times, most recent failure: Lost task 2.3 in 
> stage 24374.0 (TID 133352, lx-es-04, executor 0): 
> java.lang.ArrayIndexOutOfBoundsException: -1
>  at 
> org.apache.spark.graphx.util.collection.GraphXPrimitiveKeyOpenHashMap$mcJI$sp.apply$mcJI$sp(GraphXPrimitiveKeyOpenHashMap.scala:64)
>  at 
> org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:91)
>  at 
> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73)
>  at 
> org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:71)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
>  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2131)
>  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1035)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> 

[jira] [Resolved] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30328.
--
Resolution: Invalid

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30328) Fail to write local files with RDD.saveTextFile when setting the incorrect Hadoop configuration files

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021699#comment-17021699
 ] 

Hyukjin Kwon commented on SPARK-30328:
--

If Hadoop configuration path is set, it should be correct. Otherwise, you 
shouldn't set it at all.

> Fail to write local files with RDD.saveTextFile when setting the incorrect 
> Hadoop configuration files
> -
>
> Key: SPARK-30328
> URL: https://issues.apache.org/jira/browse/SPARK-30328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: chendihao
>Priority: Major
>
> We find that the incorrect Hadoop configuration files cause the failure of 
> saving RDD to local file system. It is not expected because we have specify 
> the local url and the API of DataFrame.write.text does not have this issue. 
> It is easy to reproduce and verify with Spark 2.3.0.
> 1.Do not set environment variable of `HADOOP_CONF_DIR`.
> 2.Install pyspark and run the local Python script. This should work and save 
> files to local file system.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> sc = spark.sparkContextrdd = sc.parallelize([1, 2, 3])
> rdd.saveAsTextFile("file:///tmp/rdd.text")
> {code}
> 3.Set environment variable of `HADOOP_CONF_DIR` and put the Hadoop 
> configuration files there. Make sure the format of `core-site.xml` is right 
> but it has an unresolved host name.
> 4.Run the same Python script again. If it try to connect HDFS and found the 
> unresolved host name, Java exception happens.
> We thinks `saveAsTextFile("file:///)` should not attempt to connect HDFS 
> whenever `HADOOP_CONF_DIR` is set or not. Actually the following code of 
> DataFrame will work with the same incorrect Hadoop configuration files.
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.master("local").getOrCreate()
> df = spark.createDataFrame(rows, ["attribute", "value"])
> df.write.parquet("file:///tmp/df.parquet")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30332.
--
Resolution: Incomplete

> When running sql query with limit catalyst throw StackOverFlow exception 
> -
>
> Key: SPARK-30332
> URL: https://issues.apache.org/jira/browse/SPARK-30332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark version 3.0.0-preview
>Reporter: Izek Greenfield
>Priority: Major
>
> Running that SQL:
> {code:sql}
> SELECT  BT_capital.asof_date,
> BT_capital.run_id,
> BT_capital.v,
> BT_capital.id,
> BT_capital.entity,
> BT_capital.level_1,
> BT_capital.level_2,
> BT_capital.level_3,
> BT_capital.level_4,
> BT_capital.level_5,
> BT_capital.level_6,
> BT_capital.path_bt_capital,
> BT_capital.line_item,
> t0.target_line_item,
> t0.line_description,
> BT_capital.col_item,
> BT_capital.rep_amount,
> root.orgUnitId,
> root.cptyId,
> root.instId,
> root.startDate,
> root.maturityDate,
> root.amount,
> root.nominalAmount,
> root.quantity,
> root.lkupAssetLiability,
> root.lkupCurrency,
> root.lkupProdType,
> root.interestResetDate,
> root.interestResetTerm,
> root.noticePeriod,
> root.historicCostAmount,
> root.dueDate,
> root.lkupResidence,
> root.lkupCountryOfUltimateRisk,
> root.lkupSector,
> root.lkupIndustry,
> root.lkupAccountingPortfolioType,
> root.lkupLoanDepositTerm,
> root.lkupFixedFloating,
> root.lkupCollateralType,
> root.lkupRiskType,
> root.lkupEligibleRefinancing,
> root.lkupHedging,
> root.lkupIsOwnIssued,
> root.lkupIsSubordinated,
> root.lkupIsQuoted,
> root.lkupIsSecuritised,
> root.lkupIsSecuritisedServiced,
> root.lkupIsSyndicated,
> root.lkupIsDeRecognised,
> root.lkupIsRenegotiated,
> root.lkupIsTransferable,
> root.lkupIsNewBusiness,
> root.lkupIsFiduciary,
> root.lkupIsNonPerforming,
> root.lkupIsInterGroup,
> root.lkupIsIntraGroup,
> root.lkupIsRediscounted,
> root.lkupIsCollateral,
> root.lkupIsExercised,
> root.lkupIsImpaired,
> root.facilityId,
> root.lkupIsOTC,
> root.lkupIsDefaulted,
> root.lkupIsSavingsPosition,
> root.lkupIsForborne,
> root.lkupIsDebtRestructuringLoan,
> root.interestRateAAR,
> root.interestRateAPRC,
> root.custom1,
> root.custom2,
> root.custom3,
> root.lkupSecuritisationType,
> root.lkupIsCashPooling,
> root.lkupIsEquityParticipationGTE10,
> root.lkupIsConvertible,
> root.lkupEconomicHedge,
> root.lkupIsNonCurrHeldForSale,
> root.lkupIsEmbeddedDerivative,
> root.lkupLoanPurpose,
> root.lkupRegulated,
> root.lkupRepaymentType,
> root.glAccount,
> root.lkupIsRecourse,
> root.lkupIsNotFullyGuaranteed,
> root.lkupImpairmentStage,
> root.lkupIsEntireAmountWrittenOff,
> root.lkupIsLowCreditRisk,
> root.lkupIsOBSWithinIFRS9,
> root.lkupIsUnderSpecialSurveillance,
> root.lkupProtection,
> root.lkupIsGeneralAllowance,
> root.lkupSectorUltimateRisk,
> root.cptyOrgUnitId,
> root.name,
> root.lkupNationality,
> root.lkupSize,
> root.lkupIsSPV,
> root.lkupIsCentralCounterparty,
> root.lkupIsMMRMFI,
> root.lkupIsKeyManagement,
> root.lkupIsOtherRelatedParty,
> root.lkupResidenceProvince,
> root.lkupIsTradingBook,
> root.entityHierarchy_entityId,
> root.entityHierarchy_Residence,
> root.lkupLocalCurrency,
> root.cpty_entityhierarchy_entityId,
> root.lkupRelationship,
> root.cpty_lkupRelationship,
> root.entityNationality,
> root.lkupRepCurrency,
> root.startDateFinancialYear,
> root.numEmployees,
> root.numEmployeesTotal,
> root.collateralAmount,
> root.guaranteeAmount,
> root.impairmentSpecificIndividual,
> root.impairmentSpecificCollective,
> root.impairmentGeneral,
> root.creditRiskAmount,
> root.provisionSpecificIndividual,
> root.provisionSpecificCollective,
> root.provisionGeneral,
> root.writeOffAmount,
> root.interest,
> root.fairValueAmount,
> root.grossCarryingAmount,
> root.carryingAmount,
> root.code,
> root.lkupInstrumentType,
> root.price,
> root.amountAtIssue,
> root.yield,
> root.totalFacilityAmount,
> root.facility_rate,
> root.spec_indiv_est,
> root.spec_coll_est,
> root.coll_inc_loss,
> root.impairment_amount,
> root.provision_amount,
> root.accumulated_impairment,
> root.exclusionFlag,
> root.lkupIsHoldingCompany,
> root.instrument_startDate,
> root.entityResidence,
> fxRate.enumerator,
> fxRate.lkupFromCurrency,
> fxRate.rate,
> fxRate.custom1,
> fxRate.custom2,
> fxRate.custom3,
> GB_position.lkupIsECGDGuaranteed,
> GB_position.lkupIsMultiAcctOffsetMortgage,
> GB_position.lkupIsIndexLinked,
> GB_position.lkupIsRetail,
> GB_position.lkupCollateralLocation,
> GB_position.percentAboveBBR,
> GB_position.lkupIsMoreInArrears,
> GB_position.lkupIsArrearsCapitalised,
> GB_position.lkupCollateralPossession,
> GB_position.lkupIsLifetimeMortgage,
> GB_position.lkupLoanConcessionType,
> GB_position.lkupIsMultiCurrency,
> 

[jira] [Resolved] (SPARK-30442) Write mode ignored when using CodecStreams

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30442.
--
Resolution: Incomplete

Resolving due to no feedback from the reporter.

> Write mode ignored when using CodecStreams
> --
>
> Key: SPARK-30442
> URL: https://issues.apache.org/jira/browse/SPARK-30442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Jesse Collins
>Priority: Major
>
> Overwrite is hardcoded to false in the codec stream. This can cause issues, 
> particularly with aws tools, that make it impossible to retry.
> Ideally, this should be read from the write mode set for the DataWriter that 
> is writing through this codec class.
> [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30444) The same job will be computated for many times when using Dataset.show()

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021697#comment-17021697
 ] 

Hyukjin Kwon commented on SPARK-30444:
--

[~aman_omer] have you made some progresses on this?

> The same job will be computated for many times when using Dataset.show()
> 
>
> Key: SPARK-30444
> URL: https://issues.apache.org/jira/browse/SPARK-30444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Dong Wang
>Priority: Major
>
> When I run the example sql.SparkSQLExample, df.show() at line 60 would 
> trigger an action. On WebUI, I noticed that this API creates 5 jobs, all of 
> which have the same lineage graph with the same RDDs and the same call 
> stacks. That means Spark recomputates the job for 5 times. But strangely, 
> sqlDF.show() at line 123 only creates 1 job.
> I don't know what happened at show() at line 60.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30462:
-
Priority: Major  (was: Critical)

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> -
>
> Key: SPARK-30462
> URL: https://issues.apache.org/jira/browse/SPARK-30462
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.4, 3.0.0
>Reporter: Vladimir Yankov
>Priority: Major
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30473) PySpark enum subclass crashes when used inside UDF

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30473.
--
Resolution: Cannot Reproduce

> PySpark enum subclass crashes when used inside UDF
> --
>
> Key: SPARK-30473
> URL: https://issues.apache.org/jira/browse/SPARK-30473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, 
> Scala 2.11)
>Reporter: Max Härtwig
>Priority: Major
>
> PySpark enum subclass crashes when used inside a UDF.
>  
> Example:
> {code:java}
> from enum import Enum
> class Direction(Enum):
>     NORTH = 0
>     SOUTH = 1
> {code}
>  
> Working:
> {code:java}
> Direction.NORTH{code}
>  
> Crashing:
> {code:java}
> @udf
> def fn(a):
> Direction.NORTH
> return ""
> df.withColumn("test", fn("a")){code}
>  
> Stacktrace:
> {noformat}
> SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 
> 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
> _read_with_length return self.loads(obj)
> File "/databricks/spark/python/pyspark/serializers.py", line 695, in 
> loads return pickle.loads(obj, encoding=encoding)
> File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
> enum_members = {k: classdict[k] for k in classdict._member_names}
> AttributeError: 'dict' object has no attribute '_member_names'{noformat}
>  
> I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in 
> the function *_save_dynamic_enum*, the attribute *_member_names* is removed 
> from the enum. Yet, this attribute is required by the *Enum* class. This 
> results in all Enum subclasses crashing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30473) PySpark enum subclass crashes when used inside UDF

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021695#comment-17021695
 ] 

Hyukjin Kwon commented on SPARK-30473:
--

This was fixed in the upstream master by upgrading cloudpickle.

> PySpark enum subclass crashes when used inside UDF
> --
>
> Key: SPARK-30473
> URL: https://issues.apache.org/jira/browse/SPARK-30473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, 
> Scala 2.11)
>Reporter: Max Härtwig
>Priority: Major
>
> PySpark enum subclass crashes when used inside a UDF.
>  
> Example:
> {code:java}
> from enum import Enum
> class Direction(Enum):
>     NORTH = 0
>     SOUTH = 1
> {code}
>  
> Working:
> {code:java}
> Direction.NORTH{code}
>  
> Crashing:
> {code:java}
> @udf
> def fn(a):
> Direction.NORTH
> return ""
> df.withColumn("test", fn("a")){code}
>  
> Stacktrace:
> {noformat}
> SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 
> 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
> _read_with_length return self.loads(obj)
> File "/databricks/spark/python/pyspark/serializers.py", line 695, in 
> loads return pickle.loads(obj, encoding=encoding)
> File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
> enum_members = {k: classdict[k] for k in classdict._member_names}
> AttributeError: 'dict' object has no attribute '_member_names'{noformat}
>  
> I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in 
> the function *_save_dynamic_enum*, the attribute *_member_names* is removed 
> from the enum. Yet, this attribute is required by the *Enum* class. This 
> results in all Enum subclasses crashing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30476) NullPointerException when Insert data to hive mongo external table by spark-sql

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30476.
--
Resolution: Not A Problem

So, it's an issue in MongoDB connector, as you described. It seems better for 
them to fix.

> NullPointerException when Insert data to hive mongo external table by 
> spark-sql
> ---
>
> Key: SPARK-30476
> URL: https://issues.apache.org/jira/browse/SPARK-30476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
> Environment: mongo-hadoop: 2.0.2
> spark-version: 2.4.3
> scala-version: 2.11
> hive-version: 1.2.1
> hadoop-version: 2.6.0
>Reporter: XiongCheng
>Priority: Major
>
> I execute the sql,but i got a NPE.
> result_data_mongo is a mongodb hive external table.
> {code:java}
> insert into result_data_mongo 
> values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15);
> {code}
> NPE detail:
> {code:java}
> org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>   at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>   at 
> com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264)
>   at 
> com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80)
>   at 
> com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>   at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>   ... 15 more
> {code}
> I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified 
> the source code of mongo-hadoop to get the correct properties 
> ('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get 
> the value. I found that these parameters are stored in spark In 
> TaskAttemptContext, but TaskAttemptContext is not passed into 
> HiveOutputWriter, is this a design flaw?
> here are two key point.
> mongo-hadoop: 
> [https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80]
> spark-hive:[https://github.com/apache/spark/blob/7c7d7f6a878b02ece881266ee538f3e1443aa8c1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala#L103]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30483) Job History does not show pool properties table

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30483.
--
Resolution: Duplicate

> Job History does not show pool properties table
> ---
>
> Key: SPARK-30483
> URL: https://issues.apache.org/jira/browse/SPARK-30483
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Stage will show the Pool Name column but when user clicks the hyper link Name>  it will not redirect to Pool Properties Table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30556) Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30556:
--
Summary: Copy sparkContext.localproperties to child thread 
inSubqueryExec.executionContext  (was: SubqueryExec passes local properties to 
SubqueryExec.executionContext)

> Copy sparkContext.localproperties to child thread 
> inSubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30487) Hive MetaException

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30487.
--
Resolution: Incomplete

>  Hive MetaException
> ---
>
> Key: SPARK-30487
> URL: https://issues.apache.org/jira/browse/SPARK-30487
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.4
>Reporter: Rakesh yadav
>Priority: Major
>
> Hi ,
> I am getting below error
> INFO TransactionTableCreation: Exception Occurred - 
> [Ljava.lang.StackTraceElement;@4fd7c296
> 20/01/10 14:09:07 INFO TransactionTableCreation: Exception Occurred - Caught 
> Hive MetaException attempting to get partition metadata by filter from Hive. 
> You can set the Spark configuration setting 
> spark.sql.hive.manageFilesourcePartitions to false to work around this 
> problem, however this will result in degraded performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30484) Job History Storage Tab does not display RDD Table

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30484.
--
Resolution: Not A Problem

> Job History Storage Tab does not display RDD Table
> --
>
> Key: SPARK-30484
> URL: https://issues.apache.org/jira/browse/SPARK-30484
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> scala> import org.apache.spark.storage.StorageLevel._
> import org.apache.spark.storage.StorageLevel._
> scala> val rdd = sc.range(0, 100, 1, 5).setName("rdd")
> rdd: org.apache.spark.rdd.RDD[Long] = rdd MapPartitionsRDD[1] at range at 
> :27
> scala> rdd.persist(MEMORY_ONLY_SER)
> res0: rdd.type = rdd MapPartitionsRDD[1] at range at :27
> scala> rdd.count
> res1: Long = 100  
>   
> scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", 
> "name")
> df: org.apache.spark.sql.DataFrame = [count: int, name: string]
> scala> df.persist(DISK_ONLY)
> res2: df.type = [count: int, name: string]
> scala> df.count
> res3: Long = 3
> Open Storage Tab under Incomplete Jobs in Job History Page
> UI will not display the RDD Table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30488) Deadlock between block-manager-slave-async-thread-pool and spark context cleaner

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021692#comment-17021692
 ] 

Hyukjin Kwon commented on SPARK-30488:
--

Is that the only place to create? Can you show full reproducer and codes if 
possible? Otherwise seems it's impossible to investigate further or reproduce.

> Deadlock between block-manager-slave-async-thread-pool and spark context 
> cleaner
> 
>
> Key: SPARK-30488
> URL: https://issues.apache.org/jira/browse/SPARK-30488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Rohit Agrawal
>Priority: Major
>
> Deadlock happens while cleaning up the spark context. Here is the full thread 
> dump:
>  
>   
>   2020-01-10T20:13:16.2884057Z Full thread dump Java HotSpot(TM) 64-Bit 
> Server VM (25.221-b11 mixed mode):
> 2020-01-10T20:13:16.2884392Z 
> 2020-01-10T20:13:16.2884660Z "SIGINT handler" #488 daemon prio=9 os_prio=2 
> tid=0x111fa000 nid=0x4794 waiting for monitor entry 
> [0x1c86e000]
> 2020-01-10T20:13:16.2884807Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2884879Z at java.lang.Shutdown.exit(Shutdown.java:212)
> 2020-01-10T20:13:16.2885693Z - waiting to lock <0xc0155de0> (a 
> java.lang.Class for java.lang.Shutdown)
> 2020-01-10T20:13:16.2885840Z at 
> java.lang.Terminator$1.handle(Terminator.java:52)
> 2020-01-10T20:13:16.2885965Z at sun.misc.Signal$1.run(Signal.java:212)
> 2020-01-10T20:13:16.2886329Z at java.lang.Thread.run(Thread.java:748)
> 2020-01-10T20:13:16.2886430Z 
> 2020-01-10T20:13:16.2886752Z "Thread-3" #108 prio=5 os_prio=0 
> tid=0x111f7800 nid=0x48cc waiting for monitor entry 
> [0x2c33f000]
> 2020-01-10T20:13:16.2886881Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2886999Z at 
> org.apache.hadoop.util.ShutdownHookManager.getShutdownHooksInOrder(ShutdownHookManager.java:273)
> 2020-01-10T20:13:16.2887107Z at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:121)
> 2020-01-10T20:13:16.2887212Z at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> 2020-01-10T20:13:16.2887421Z 
> 2020-01-10T20:13:16.2887798Z "block-manager-slave-async-thread-pool-81" #486 
> daemon prio=5 os_prio=0 tid=0x111fe800 nid=0x2e34 waiting for monitor 
> entry [0x2bf3d000]
> 2020-01-10T20:13:16.2889192Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2889305Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:404)
> 2020-01-10T20:13:16.2889405Z - waiting to lock <0xc1f359f0> (a 
> sbt.internal.LayeredClassLoader)
> 2020-01-10T20:13:16.2889482Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> 2020-01-10T20:13:16.2889582Z - locked <0xca33e4c8> (a 
> sbt.internal.ManagedClassLoader$ZombieClassLoader)
> 2020-01-10T20:13:16.2889659Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> 2020-01-10T20:13:16.2890881Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:58)
> 2020-01-10T20:13:16.2891006Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57)
> 2020-01-10T20:13:16.2891142Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57)
> 2020-01-10T20:13:16.2891260Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:86)
> 2020-01-10T20:13:16.2891375Z at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> 2020-01-10T20:13:16.2891624Z at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> 2020-01-10T20:13:16.2891737Z at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2020-01-10T20:13:16.2891833Z at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-01-10T20:13:16.2891925Z at java.lang.Thread.run(Thread.java:748)
> 2020-01-10T20:13:16.2891967Z 
> 2020-01-10T20:13:16.2892066Z "pool-31-thread-16" #335 prio=5 os_prio=0 
> tid=0x153b2000 nid=0x1aac waiting on condition [0x4b2ff000]
> 2020-01-10T20:13:16.2892147Z java.lang.Thread.State: WAITING (parking)
> 2020-01-10T20:13:16.2892241Z at sun.misc.Unsafe.park(Native Method)
> 2020-01-10T20:13:16.2892328Z - parking to wait for <0xc9cad078> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-01-10T20:13:16.2892437Z at 
> 

[jira] [Assigned] (SPARK-30556) SubqueryExec passes local properties to SubqueryExec.executionContext

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30556:
-

Assignee: Ajith S

> SubqueryExec passes local properties to SubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30513) Question about spark on k8s

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30513.
--
Resolution: Invalid

Please ask qeustions at dev mailing list or stackoverflow. See 
https://spark.apache.org/community.html

> Question about spark on k8s
> ---
>
> Key: SPARK-30513
> URL: https://issues.apache.org/jira/browse/SPARK-30513
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Jackey Lee
>Priority: Major
>
> My question is, why we wrote the domain name of Kube-DNS in the code? Isn't
> it better to read domain name from the service, or just use the hostname?
> In our scenario, we run spark on Kata-like containers, and found the code
> had written the Kube-DNS domain. If Kube-DNS is not configured in
> environment, tasks would run failed.
> Besides, kube-dns is just a plugin for k8s, not a required component for k8s. 
> We can use better DNS services without necessarily using this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30556) SubqueryExec passes local properties to SubqueryExec.executionContext

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30556.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27267
[https://github.com/apache/spark/pull/27267]

> SubqueryExec passes local properties to SubqueryExec.executionContext
> -
>
> Key: SPARK-30556
> URL: https://issues.apache.org/jira/browse/SPARK-30556
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing  jobs and threadpools have idle threads which are 
> reused
> Explanation:
> When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30526) Can I translate Spark documents into Chinese ?

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30526.
--
Resolution: Won't Fix

I think it's better to do it in a separate thridparty repository. Not in Spark.

> Can I translate Spark documents into Chinese ?
> --
>
> Key: SPARK-30526
> URL: https://issues.apache.org/jira/browse/SPARK-30526
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: WangQiang Yang
>Priority: Major
>
> Can I translate Spark documents into Chinese for everyone to learn, will I 
> face legal risks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30529) Improve error messages when Executor dies before registering with driver

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30529:
-
Description: 
currently when you give a bad configuration for accelerator aware scheduling to 
the executor, the Executors can die but its hard for the user to know why.  The 
executor dies and logs in its log files what went wrong but many times it hard 
to find those logs because the executor hasn't registered yet.  Since it hasn't 
registered the executor doesn't show up on UI to see log files.

One specific example is you give a discovery script that that doesn't find all 
the GPUs:

{code}
20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@10.28.9.112:44403
20/01/16 08:59:24 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
addresses: 0 is less than what the user requested: 2)
 at scala.Predef$.require(Predef.scala:281)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
{code}

 

Figure out a better way of logging or letting user know  what error occurred 
when the executor dies before registering

  was:
currently when you give a bad configuration for accelerator aware scheduling to 
the executor, the Executors can die but its hard for the user to know why.  The 
executor dies and logs in its log files what went wrong but many times it hard 
to find those logs because the executor hasn't registered yet.  Since it hasn't 
registered the executor doesn't show up on UI to see log files.

One specific example is you give a discovery script that that doesn't find all 
the GPUs:

20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to driver: 
spark://CoarseGrainedScheduler@10.28.9.112:44403
20/01/16 08:59:24 ERROR Inbox: Ignoring error
java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
addresses: 0 is less than what the user requested: 2)
 at scala.Predef$.require(Predef.scala:281)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
 at 
org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)

 

Figure out a better way of logging or letting user know  what error occurred 
when the executor dies before registering


> Improve error messages when Executor dies before registering with driver
> 
>
> Key: SPARK-30529
> URL: https://issues.apache.org/jira/browse/SPARK-30529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> currently when you give a bad configuration for accelerator aware scheduling 
> to the executor, the Executors can die but its hard for the user to know why. 
>  The executor dies and logs in its log files what went wrong but many times 
> it hard to find those logs because the executor hasn't registered yet.  Since 
> it hasn't registered the executor doesn't show up on UI to see log files.
> One specific example is you give a discovery script that that doesn't find 
> all the GPUs:
> {code}
> 20/01/16 08:59:24 INFO YarnCoarseGrainedExecutorBackend: Connecting to 
> driver: spark://CoarseGrainedScheduler@10.28.9.112:44403
> 20/01/16 08:59:24 ERROR Inbox: Ignoring error
> java.lang.IllegalArgumentException: requirement failed: Resource: gpu, with 
> addresses: 0 is less than what the user requested: 2)
>  at scala.Predef$.require(Predef.scala:281)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1(ResourceUtils.scala:251)
>  at 
> org.apache.spark.resource.ResourceUtils$.$anonfun$assertAllResourceAllocationsMatchResourceProfile$1$adapted(ResourceUtils.scala:248)
> {code}
>  
> Figure out a better way of logging or letting user know  what error occurred 
> when the executor dies before registering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30542.
--
Resolution: Invalid

Please ask questions to mailing list. You could have a better answer. See 
https://spark.apache.org/community.html

Also, 2.3.0 is EOL release. You should try in a higher version.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30550) Random pyspark-shell applications being generated

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30550.
--
Resolution: Cannot Reproduce

Seems like no one can reproduce this.

> Random pyspark-shell applications being generated
> -
>
> Key: SPARK-30550
> URL: https://issues.apache.org/jira/browse/SPARK-30550
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit, Web UI
>Affects Versions: 2.3.2, 2.4.4
>Reporter: Ram
>Priority: Major
> Attachments: Screenshot from 2020-01-17 15-43-33.png
>
>
>  
> When we submit a particular spark job, this happens. Not sure from where 
> these pyspark-shell applications get generated, but they persist for like 5s 
> and gets killed. We're not able to figure put why this happens



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021684#comment-17021684
 ] 

Hyukjin Kwon commented on SPARK-30557:
--

Yup, and {{spark.driver.extraJavaOptions}} and 
{{spark.executor.extraJavaOptions}}. I vageuly remember it's preferred to use 
Spark configuration over environment variables.
Both are documented under 
https://spark.apache.org/docs/latest/configuration.html.

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30561) start spark applications without a 30second startup penalty

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021682#comment-17021682
 ] 

Hyukjin Kwon commented on SPARK-30561:
--

Is any of them safe to remove?

> start spark applications without a 30second startup penalty
> ---
>
> Key: SPARK-30561
> URL: https://issues.apache.org/jira/browse/SPARK-30561
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: t oo
>Priority: Major
>
> see 
> https://stackoverflow.com/questions/57610138/how-to-start-spark-applications-without-a-30second-startup-penalty
> using spark standalone.
> There are several sleeps that can be removed:
> grep -i 'sleep(' -R * | grep -v 'src/test/' | grep -E '^core' | grep -ivE 
> 'mesos|yarn|python|HistoryServer|spark/ui/'
> core/src/main/scala/org/apache/spark/util/Clock.scala:  
> Thread.sleep(sleepTime)
> core/src/main/scala/org/apache/spark/SparkContext.scala:   * sc.parallelize(1 
> to 1, 2).map { i => Thread.sleep(10); i }.count()
> core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala:  
> private def delay(secs: Duration = 5.seconds) = Thread.sleep(secs.toMillis)
> core/src/main/scala/org/apache/spark/deploy/FaultToleranceTest.scala: 
>  Thread.sleep(1000)
> core/src/main/scala/org/apache/spark/deploy/Client.scala:
> Thread.sleep(5000)
> core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala:  
> Thread.sleep(100)
> core/src/main/scala/org/apache/spark/deploy/StandaloneResourceUtils.scala:
>   Thread.sleep(duration)
> core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:def 
> sleep(seconds: Int): Unit = (0 until seconds).takeWhile { _ =>
> core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:  
> Thread.sleep(1000)
> core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:
> sleeper.sleep(waitSeconds)
> core/src/main/scala/org/apache/spark/deploy/worker/DriverRunner.scala:  def 
> sleep(seconds: Int): Unit
> core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionClient.scala:  
> Thread.sleep(REPORT_DRIVER_STATUS_INTERVAL)
> core/src/main/scala/org/apache/spark/scheduler/AsyncEventQueue.scala:  
> Thread.sleep(10)
> core/src/main/scala/org/apache/spark/storage/BlockManager.scala:  
> Thread.sleep(SLEEP_TIME_SECS * 1000L)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021679#comment-17021679
 ] 

Hyukjin Kwon commented on SPARK-30577:
--

Spark 2.3 is EOL. Can you try it in higher versions? Also, it would be great 
how the table was created; otherwise, no one could reproduce.

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30577) StorageLevel.DISK_ONLY_2 causes the data loss

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30577.
--
Resolution: Incomplete

I am resolving this as incomplete as it targets EOL release.

> StorageLevel.DISK_ONLY_2 causes the data loss
> -
>
> Key: SPARK-30577
> URL: https://issues.apache.org/jira/browse/SPARK-30577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: zero222
>Priority: Major
> Attachments: DISK_ONLY_2.png
>
>
> As shown in the attachment,after I load the data of the hive table which is 
> immutable and cache the data in the level of the DISK_ONLY_2,the count value 
> of the data  is different every time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30580) Why can PySpark persist data only in serialised format?

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021677#comment-17021677
 ] 

Hyukjin Kwon commented on SPARK-30580:
--

Let's ask questions to mailing list. You could have a better answer there. See 
https://spark.apache.org/community.html

> Why can PySpark persist data only in serialised format?
> ---
>
> Key: SPARK-30580
> URL: https://issues.apache.org/jira/browse/SPARK-30580
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Francesco Cavrini
>Priority: Minor
>  Labels: performance
>
> The storage levels in PySpark allow to persist data only in serialised 
> format. There is also [a 
> comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]]
>  explicitly stating that "Since the data is always serialized on the Python 
> side, all the constants use the serialized formats." While that makes totally 
> sense for RDDs, it is not clear to me why it is not possible to persist data 
> without serialisation when using the dataframe/dataset APIs. In theory, in 
> such cases, the persist would only be a directive and data would never leave 
> the JVM, thus allowing for un-serialised persistence, correct? Many thanks 
> for the feedback!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30580) Why can PySpark persist data only in serialised format?

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30580.
--
Resolution: Invalid

> Why can PySpark persist data only in serialised format?
> ---
>
> Key: SPARK-30580
> URL: https://issues.apache.org/jira/browse/SPARK-30580
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Francesco Cavrini
>Priority: Minor
>  Labels: performance
>
> The storage levels in PySpark allow to persist data only in serialised 
> format. There is also [a 
> comment|[https://github.com/apache/spark/blob/master/python/pyspark/storagelevel.py#L28]]
>  explicitly stating that "Since the data is always serialized on the Python 
> side, all the constants use the serialized formats." While that makes totally 
> sense for RDDs, it is not clear to me why it is not possible to persist data 
> without serialisation when using the dataframe/dataset APIs. In theory, in 
> such cases, the persist would only be a directive and data would never leave 
> the JVM, thus allowing for un-serialised persistence, correct? Many thanks 
> for the feedback!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30585) scalatest fails for Apache Spark SQL project

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30585.
--
Resolution: Incomplete

> scalatest fails for Apache Spark SQL project
> 
>
> Key: SPARK-30585
> URL: https://issues.apache.org/jira/browse/SPARK-30585
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Rashmi
>Priority: Major
>
> Error logs:-
> 23:36:49.039 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 3.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:49.039 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 3.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.354 WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor: Current 
> batch is falling behind. The trigger interval is 100 milliseconds, but spent 
> 1854 milliseconds
> 23:36:51.381 WARN 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader$DataReaderThread:
>  data reader thread failed
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>  at 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStreamInputPartitionReader.getRecord(ContinuousMemoryStream.scala:195)
>  at 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStreamInputPartitionReader.next(ContinuousMemoryStream.scala:181)
>  at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader$DataReaderThread.run(ContinuousQueuedDataReader.scala:143)
> Caused by: org.apache.spark.SparkException: Could not find 
> ContinuousMemoryStreamRecordEndpoint-f7d4460c-9f4e-47ee-a846-258b34964852-9.
>  at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
>  at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>  at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>  at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>  ... 4 more
> 23:36:51.389 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 4.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.390 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 4.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
> - flatMap
> 23:36:51.754 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 5.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.754 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 5.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.248 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 6.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.249 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 6.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
> - filter
> 23:36:52.611 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 7.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.611 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 7.0 (TID 15, localhost, executor driver): TaskKilled (Stage cancelled)
> - deduplicate
> - timestamp
> 23:36:53.015 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 8.0 (TID 16, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.015 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 8.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
> - subquery alias
> 23:36:53.572 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 9.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.572 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 9.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.953 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 10.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.953 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 10.0 (TID 20, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:54.552 WARN 

[jira] [Commented] (SPARK-30585) scalatest fails for Apache Spark SQL project

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021675#comment-17021675
 ] 

Hyukjin Kwon commented on SPARK-30585:
--

and please just don't copy and paste the logs. You should file an issue with 
describing symptoms with a reproducer if possible.
I am resolving this until they are provided clearly.

> scalatest fails for Apache Spark SQL project
> 
>
> Key: SPARK-30585
> URL: https://issues.apache.org/jira/browse/SPARK-30585
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Rashmi
>Priority: Major
>
> Error logs:-
> 23:36:49.039 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 3.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:49.039 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 3.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.354 WARN 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor: Current 
> batch is falling behind. The trigger interval is 100 milliseconds, but spent 
> 1854 milliseconds
> 23:36:51.381 WARN 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader$DataReaderThread:
>  data reader thread failed
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>  at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>  at 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStreamInputPartitionReader.getRecord(ContinuousMemoryStream.scala:195)
>  at 
> org.apache.spark.sql.execution.streaming.sources.ContinuousMemoryStreamInputPartitionReader.next(ContinuousMemoryStream.scala:181)
>  at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousQueuedDataReader$DataReaderThread.run(ContinuousQueuedDataReader.scala:143)
> Caused by: org.apache.spark.SparkException: Could not find 
> ContinuousMemoryStreamRecordEndpoint-f7d4460c-9f4e-47ee-a846-258b34964852-9.
>  at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
>  at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>  at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>  at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>  at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>  ... 4 more
> 23:36:51.389 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 4.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.390 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 4.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
> - flatMap
> 23:36:51.754 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 5.0 (TID 11, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:51.754 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 5.0 (TID 10, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.248 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 6.0 (TID 13, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.249 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 6.0 (TID 12, localhost, executor driver): TaskKilled (Stage cancelled)
> - filter
> 23:36:52.611 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 7.0 (TID 14, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:52.611 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 7.0 (TID 15, localhost, executor driver): TaskKilled (Stage cancelled)
> - deduplicate
> - timestamp
> 23:36:53.015 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 8.0 (TID 16, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.015 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 8.0 (TID 17, localhost, executor driver): TaskKilled (Stage cancelled)
> - subquery alias
> 23:36:53.572 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 9.0 (TID 19, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.572 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 9.0 (TID 18, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.953 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 10.0 (TID 21, localhost, executor driver): TaskKilled (Stage cancelled)
> 23:36:53.953 WARN 

[jira] [Commented] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021674#comment-17021674
 ] 

Hyukjin Kwon commented on SPARK-30590:
--

Seems fixed in the master:

{code}
scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
+-+-+-+-+-+
|foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
+-+-+-+-+-+
|3|5|7|9|   11|
+-+-+-+-+-+
{code}

It would be great if we can identify which JIRA fixed it and see if we can 
backport.

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Daniel Mantovani
>Priority: Major
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation 

[jira] [Resolved] (SPARK-30590) can't use more than five type-safe user-defined aggregation in select statement

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30590.
--
Resolution: Cannot Reproduce

> can't use more than five type-safe user-defined aggregation in select 
> statement
> ---
>
> Key: SPARK-30590
> URL: https://issues.apache.org/jira/browse/SPARK-30590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: Daniel Mantovani
>Priority: Major
>
>  How to reproduce:
> {code:scala}
> val df = Seq((1,2,3,4,5,6)).toDF("a","b","c","d","e","f")
> import org.apache.spark.sql.expressions.Aggregator
> import org.apache.spark.sql.Encoder
> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Row
> case class FooAgg(s:Int) extends Aggregator[Row, Int, Int] {
>   def zero:Int = s
>   def reduce(b: Int, r: Row): Int = b + r.getAs[Int](0)
>   def merge(b1: Int, b2: Int): Int = b1 + b2
>   def finish(b: Int): Int = b
>   def bufferEncoder: Encoder[Int] = Encoders.scalaInt
>   def outputEncoder: Encoder[Int] = Encoders.scalaInt
> }
> val fooAgg = (i:Int) => FooAgg(i).toColumn.name(s"foo_agg_$i")
> scala> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5)).show
> +-+-+-+-+-+
> |foo_agg_1|foo_agg_2|foo_agg_3|foo_agg_4|foo_agg_5|
> +-+-+-+-+-+
> |3|5|7|9|   11|
> +-+-+-+-+-+
> {code}
> With 6 arguments we have error:
> {code:scala}
> scala> 
> df.select(fooAgg(1),fooAgg(2),fooAgg(3),fooAgg(4),fooAgg(5),fooAgg(6)).show
> org.apache.spark.sql.AnalysisException: unresolved operator 'Aggregate 
> [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS value#114, 
> assertnotnull(cast(value#114 as int)), input[0, int, false] AS value#113, 
> IntegerType, IntegerType, false) AS foo_agg_1#116, fooagg(FooAgg(2), None, 
> None, None, input[0, int, false] AS value#119, assertnotnull(cast(value#119 
> as int)), input[0, int, false] AS value#118, IntegerType, IntegerType, false) 
> AS foo_agg_2#121, fooagg(FooAgg(3), None, None, None, input[0, int, false] AS 
> value#124, assertnotnull(cast(value#124 as int)), input[0, int, false] AS 
> value#123, IntegerType, IntegerType, false) AS foo_agg_3#126, 
> fooagg(FooAgg(4), None, None, None, input[0, int, false] AS value#129, 
> assertnotnull(cast(value#129 as int)), input[0, int, false] AS value#128, 
> IntegerType, IntegerType, false) AS foo_agg_4#131, fooagg(FooAgg(5), None, 
> None, None, input[0, int, false] AS value#134, assertnotnull(cast(value#134 
> as int)), input[0, int, false] AS value#133, IntegerType, IntegerType, false) 
> AS foo_agg_5#136, fooagg(FooAgg(6), None, None, None, input[0, int, false] AS 
> value#139, assertnotnull(cast(value#139 as int)), input[0, int, false] AS 
> value#138, IntegerType, IntegerType, false) AS foo_agg_6#141];;
> 'Aggregate [fooagg(FooAgg(1), None, None, None, input[0, int, false] AS 
> value#114, assertnotnull(cast(value#114 as int)), input[0, int, false] AS 
> value#113, IntegerType, IntegerType, false) AS foo_agg_1#116, 
> fooagg(FooAgg(2), None, None, None, input[0, int, false] AS value#119, 
> assertnotnull(cast(value#119 as int)), input[0, int, false] AS value#118, 
> IntegerType, IntegerType, false) AS foo_agg_2#121, fooagg(FooAgg(3), None, 
> None, None, input[0, int, false] AS value#124, assertnotnull(cast(value#124 
> as int)), input[0, int, false] AS value#123, IntegerType, IntegerType, false) 
> AS foo_agg_3#126, fooagg(FooAgg(4), None, None, None, input[0, int, false] AS 
> value#129, assertnotnull(cast(value#129 as int)), input[0, int, false] AS 
> value#128, IntegerType, IntegerType, false) AS foo_agg_4#131, 
> fooagg(FooAgg(5), None, None, None, input[0, int, false] AS value#134, 
> assertnotnull(cast(value#134 as int)), input[0, int, false] AS value#133, 
> IntegerType, IntegerType, false) AS foo_agg_5#136, fooagg(FooAgg(6), None, 
> None, None, input[0, int, false] AS value#139, assertnotnull(cast(value#139 
> as int)), input[0, int, false] AS value#138, IntegerType, IntegerType, false) 
> AS foo_agg_6#141]
> +- Project [_1#6 AS a#13, _2#7 AS b#14, _3#8 AS c#15, _4#9 AS d#16, _5#10 AS 
> e#17, _6#11 AS F#18]
>  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:43)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:431)
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$3.apply(CheckAnalysis.scala:430)
> 

[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-01-22 Thread Min Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Shen updated SPARK-30602:
-
Description: 
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.

 

Link to dev mailing list discussion: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html

  was:
In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.


> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable 

[jira] [Updated] (SPARK-30597) Unable to load properties fine in SparkStandalone HDFS mode

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30597:
-
Description: 
We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS 
mode.

when the application is submitted in standalone HDFS mode the configuration 
jar(properties file) is not read when the application is started, and this make 
the logger to fall back to default log file and log level.

So when application is submitted to standalone HDFS, configuration files are 
not read. 

STD OUT LOGS from Standalone HDFS  - properties file is not found 

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Trying to find [osa-log.properties] using 
sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader.
log4j: Trying to find [osa-log.properties] using 
ClassLoader.getSystemResource().
log4j: Could not find resource: [osa-log.properties].
log4j: Reading configuration from URL 
jar:[file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties|file:///osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties]
 {code}

 

 

STD out from Standalone NFS  - properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Using URL 
[jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
{code}

 

STD out from YARN HDFS  — properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@2626b418.
log4j: Using URL 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties|file:///tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
log4j: configuration 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties|file:///tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
{code}

 

 

  was:
We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS 
mode.

when the application is submitted in standalone HDFS mode the configuration 
jar(properties file) is not read when the application is started, and this make 
the logger to fall back to default log file and log level.

So when application is submitted to standalone HDFS, configuration files are 
not read. 


STD OUT LOGS from Standalone HDFS  - properties file is not found 

==

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Trying to find [osa-log.properties] using 
sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader.
log4j: Trying to find [osa-log.properties] using 
ClassLoader.getSystemResource().
log4j: Could not find resource: [osa-log.properties].
log4j: Reading configuration from URL 
jar:file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties

 

 

STD out from Standalone NFS  - properties files are found and able to load it

===

 

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Using URL 
[jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator

 

STD out from YARN HDFS  — properties files are found and able to load it

===

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@2626b418.
log4j: Using URL 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
log4j: configuration 
file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties

 

 

 


> Unable to load properties fine in SparkStandalone HDFS mode
> ---
>
> Key: SPARK-30597
> URL: https://issues.apache.org/jira/browse/SPARK-30597
> 

[jira] [Updated] (SPARK-30597) Unable to load properties fine in SparkStandalone HDFS mode

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30597:
-
Description: 
We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS 
mode.

when the application is submitted in standalone HDFS mode the configuration 
jar(properties file) is not read when the application is started, and this make 
the logger to fall back to default log file and log level.

So when application is submitted to standalone HDFS, configuration files are 
not read. 

STD OUT LOGS from Standalone HDFS  - properties file is not found 

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Trying to find [osa-log.properties] using 
sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader.
log4j: Trying to find [osa-log.properties] using 
ClassLoader.getSystemResource().
log4j: Could not find resource: [osa-log.properties].
log4j: Reading configuration from URL 
jar:[file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties|file:///osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties]
{code}


STD out from Standalone NFS  - properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Using URL 
[jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
{code}


STD out from YARN HDFS  — properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@2626b418.
log4j: Using URL 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties|file:///tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
log4j: configuration 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties|file:///tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
{code}

 

 

  was:
We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS 
mode.

when the application is submitted in standalone HDFS mode the configuration 
jar(properties file) is not read when the application is started, and this make 
the logger to fall back to default log file and log level.

So when application is submitted to standalone HDFS, configuration files are 
not read. 

STD OUT LOGS from Standalone HDFS  - properties file is not found 

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Trying to find [osa-log.properties] using 
sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader.
log4j: Trying to find [osa-log.properties] using 
ClassLoader.getSystemResource().
log4j: Could not find resource: [osa-log.properties].
log4j: Reading configuration from URL 
jar:[file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties|file:///osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties]
 {code}

 

 

STD out from Standalone NFS  - properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Using URL 
[jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
{code}

 

STD out from YARN HDFS  — properties files are found and able to load it

{code}
log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@2626b418.
log4j: Using URL 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties|file:///tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
log4j: configuration 

[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021672#comment-17021672
 ] 

Hyukjin Kwon commented on SPARK-30602:
--

Can you send the email to the dev list to discuss? If you already did, please 
add a link of the mailing thread in the PR description.

 

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30608) Postgres Column Interval converts to string and cant be written back to postgres

2020-01-22 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021669#comment-17021669
 ] 

Hyukjin Kwon commented on SPARK-30608:
--

I don't see interval type conversions are supported between Spark and 
PostgreSQL. Seems like you save as string and want it to be converted 
automatically to interval type in PostgreSQL side.
IntervalType is currently private in Spark side. So you should either save it 
as string in PostgreSQL too and cast it to interval when you need to use.

> Postgres Column Interval converts to string and cant be written back to 
> postgres
> 
>
> Key: SPARK-30608
> URL: https://issues.apache.org/jira/browse/SPARK-30608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Sumit
>Priority: Major
>
> If we read a  "Interval" type column from postgres and try to save it back to 
> postgres, an exception is occured as during read operation postgres column is 
> converted to String and while saving back it gives an error
>  
> java.sql.BatchUpdateException: Batch entry 0 INSERT INTO test_table 
> ("dob","dob_time","dob_time_zone","duration") VALUES ('2019-05-29 
> -04','2016-08-12 10:22:31.10-04','2016-08-12 13:22:31.10-04','3 days 
> 10:00:00') was aborted: ERROR: column "duration" is of type interval but 
> expression is of type character varying
>  Hint: You will need to rewrite or cast the expression.
>  Position: 86 Call getNextException to see other errors in the batch.
>  at 
> org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:151)
>  at 
> org.postgresql.core.ResultHandlerDelegate.handleError(ResultHandlerDelegate.java:45)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2159)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:463)
>  at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:794)
>  at 
> org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1662)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:672)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30602:
-
Summary: SPIP: Support push-based shuffle to improve shuffle efficiency  
(was: Support push-based shuffle to improve shuffle efficiency)

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30608) Postgres Column Interval converts to string and cant be written back to postgres

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30608.
--
Resolution: Invalid

> Postgres Column Interval converts to string and cant be written back to 
> postgres
> 
>
> Key: SPARK-30608
> URL: https://issues.apache.org/jira/browse/SPARK-30608
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Sumit
>Priority: Major
>
> If we read a  "Interval" type column from postgres and try to save it back to 
> postgres, an exception is occured as during read operation postgres column is 
> converted to String and while saving back it gives an error
>  
> java.sql.BatchUpdateException: Batch entry 0 INSERT INTO test_table 
> ("dob","dob_time","dob_time_zone","duration") VALUES ('2019-05-29 
> -04','2016-08-12 10:22:31.10-04','2016-08-12 13:22:31.10-04','3 days 
> 10:00:00') was aborted: ERROR: column "duration" is of type interval but 
> expression is of type character varying
>  Hint: You will need to rewrite or cast the expression.
>  Position: 86 Call getNextException to see other errors in the batch.
>  at 
> org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:151)
>  at 
> org.postgresql.core.ResultHandlerDelegate.handleError(ResultHandlerDelegate.java:45)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2159)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:463)
>  at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:794)
>  at 
> org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1662)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:672)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:834)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:123)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30463) Move test cases for 'pandas' sub-package

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30463.
--
Resolution: Later

Let me take a look for this later. Fortunately, the tests are grouped in 
separate files for now.

> Move test cases for 'pandas' sub-package
> 
>
> Key: SPARK-30463
> URL: https://issues.apache.org/jira/browse/SPARK-30463
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should also move test cases into pandas sub-directoreis



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30531) Duplicate query plan on Spark UI SQL page

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30531:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Major)

> Duplicate query plan on Spark UI SQL page
> -
>
> Key: SPARK-30531
> URL: https://issues.apache.org/jira/browse/SPARK-30531
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Minor
> Fix For: 3.0.0
>
>
> When you save a Spark UI SQL query page to disk and then display the html 
> file with your browser, the query plan will be rendered a second time. This 
> change avoids rendering the plan visualization when it exists already.
>  
> !https://user-images.githubusercontent.com/44700269/72543429-fcb8d980-3885-11ea-82aa-c0b3638847e5.png!
> The fix does not call {{renderPlanViz()}} when the plan exists already:
> !https://user-images.githubusercontent.com/44700269/72543641-57523580-3886-11ea-8cdf-5fb0cdffa983.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30531) Duplicate query plan on Spark UI SQL page

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30531.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27238
[https://github.com/apache/spark/pull/27238]

> Duplicate query plan on Spark UI SQL page
> -
>
> Key: SPARK-30531
> URL: https://issues.apache.org/jira/browse/SPARK-30531
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.0.0
>
>
> When you save a Spark UI SQL query page to disk and then display the html 
> file with your browser, the query plan will be rendered a second time. This 
> change avoids rendering the plan visualization when it exists already.
>  
> !https://user-images.githubusercontent.com/44700269/72543429-fcb8d980-3885-11ea-82aa-c0b3638847e5.png!
> The fix does not call {{renderPlanViz()}} when the plan exists already:
> !https://user-images.githubusercontent.com/44700269/72543641-57523580-3886-11ea-8cdf-5fb0cdffa983.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30531) Duplicate query plan on Spark UI SQL page

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30531:


Assignee: Enrico Minack

> Duplicate query plan on Spark UI SQL page
> -
>
> Key: SPARK-30531
> URL: https://issues.apache.org/jira/browse/SPARK-30531
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
>
> When you save a Spark UI SQL query page to disk and then display the html 
> file with your browser, the query plan will be rendered a second time. This 
> change avoids rendering the plan visualization when it exists already.
>  
> !https://user-images.githubusercontent.com/44700269/72543429-fcb8d980-3885-11ea-82aa-c0b3638847e5.png!
> The fix does not call {{renderPlanViz()}} when the plan exists already:
> !https://user-images.githubusercontent.com/44700269/72543641-57523580-3886-11ea-8cdf-5fb0cdffa983.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29701:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Blocker
>  Labels: correctness
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29701) Different answers when empty input given in GROUPING SETS

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29701:
--
Affects Version/s: 2.4.4

> Different answers when empty input given in GROUPING SETS
> -
>
> Key: SPARK-29701
> URL: https://issues.apache.org/jira/browse/SPARK-29701
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Blocker
>  Labels: correctness
>
> A query below with an empty input seems to have different answers between 
> PgSQL and Spark;
> {code:java}
> postgres=# create table gstest_empty (a integer, b integer, v integer);
> CREATE TABLE
> postgres=# select a, b, sum(v), count(*) from gstest_empty group by grouping 
> sets ((a,b),());
>  a | b | sum | count 
> ---+---+-+---
>|   | | 0
> (1 row)
> {code}
> {code:java}
> scala> sql("""select a, b, sum(v), count(*) from gstest_empty group by 
> grouping sets ((a,b),())""").show
> +---+---+--++
> |  a|  b|sum(v)|count(1)|
> +---+---+--++
> +---+---+--++
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30611) Update testthat dependency

2020-01-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30611.
--
Resolution: Duplicate

> Update testthat dependency
> --
>
> Key: SPARK-30611
> URL: https://issues.apache.org/jira/browse/SPARK-30611
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently SparkR fix {{testhat}} version to 1.0.2.
>  
> This has been introduced by SPARK-22817 to address removal of 
> {{testthat}}:::run_tests in {{testthat}} 2.0.0.
>  
> As of 2.0.0 the replacement of {{run_tests}} is {{test_package_dir}}.  Same 
> as {{run_tests}} it is considered internal, but it will allow us to upgrade 
> to the latest {{testhat}} release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28801) Document SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-28801:
-
Priority: Minor  (was: Major)

> Document SELECT statement in SQL Reference.
> ---
>
> Key: SPARK-28801
> URL: https://issues.apache.org/jira/browse/SPARK-28801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28801) Document SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28801.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27216
[https://github.com/apache/spark/pull/27216]

> Document SELECT statement in SQL Reference.
> ---
>
> Key: SPARK-28801
> URL: https://issues.apache.org/jira/browse/SPARK-28801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28801) Document SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-28801:


Assignee: Dilip Biswal

> Document SELECT statement in SQL Reference.
> ---
>
> Key: SPARK-28801
> URL: https://issues.apache.org/jira/browse/SPARK-28801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30574) Document GROUP BY Clause of SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30574.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27283
[https://github.com/apache/spark/pull/27283]

> Document GROUP BY Clause of SELECT statement in SQL Reference.
> --
>
> Key: SPARK-30574
> URL: https://issues.apache.org/jira/browse/SPARK-30574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30574) Document GROUP BY Clause of SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30574:
-
Priority: Minor  (was: Major)

> Document GROUP BY Clause of SELECT statement in SQL Reference.
> --
>
> Key: SPARK-30574
> URL: https://issues.apache.org/jira/browse/SPARK-30574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30574) Document GROUP BY Clause of SELECT statement in SQL Reference.

2020-01-22 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30574:


Assignee: Dilip Biswal

> Document GROUP BY Clause of SELECT statement in SQL Reference.
> --
>
> Key: SPARK-30574
> URL: https://issues.apache.org/jira/browse/SPARK-30574
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-01-22 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26132:
-
Docs Text: Scala 2.11 support is removed in Apache Spark 3.0.0.

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26132) Remove support for Scala 2.11 in Spark 3.0.0

2020-01-22 Thread Shixiong Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26132:
-
Labels: release-notes  (was: )

> Remove support for Scala 2.11 in Spark 3.0.0
> 
>
> Key: SPARK-26132
> URL: https://issues.apache.org/jira/browse/SPARK-26132
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Per some discussion on the mailing list, we are_considering_ formally not 
> supporting Scala 2.11 in Spark 3.0. This JIRA tracks that discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26154) Stream-stream joins - left outer join gives inconsistent output

2020-01-22 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021627#comment-17021627
 ] 

Jungtaek Lim commented on SPARK-26154:
--

Leaving information why this issue cannot be easily ported back to 2.x version 
line although it fixes the correctness.

 

We fixed the issue with "destructive way" on existing query as the state in 
existing query cannot be corrected. In 3.0 migration guide we added below 
content:
{quote}Spark 3.0 fixes the correctness issue on Stream-stream outer join, which 
changes the schema of state. (SPARK-26154 for more details) Spark 3.0 will fail 
the query if you start your query from checkpoint constructed from Spark 2.x 
which uses stream-stream outer join. Please discard the checkpoint and replay 
previous inputs to recalculate outputs.
{quote}
End users might not be mad if "major" version requires them to lose something. 
They still have Spark 2.x version line to deny the change and take the risk (as 
the issue is occurred from "edge-case"). If we put this to 2.x version line 
they may have no way to deny.

Please note that unlike other states having versions which can be co-existed, 
for stream-stream outer join, there're only "valid" (state format version = 2) 
and "invalid" (state format version = 1) state format which cannot be 
co-existed.

We cannot still simply drop state format version 1, because stream-stream inner 
join is not affected by this bug and we don't want to let the case also discard 
the state. It will affect too many existing queries and I'd like to reduce the 
impact.

> Stream-stream joins - left outer join gives inconsistent output
> ---
>
> Key: SPARK-26154
> URL: https://issues.apache.org/jira/browse/SPARK-26154
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.2, 3.0.0
> Environment: Spark version - Spark 2.3.2
> OS- Suse 11
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.0.0
>
>
> Stream-stream joins using left outer join gives inconsistent  output 
> The data processed once, is being processed again and gives null value. In 
> Batch 2, the input data  "3" is processed. But again in batch 6, null value 
> is provided for same data
> Steps
> In spark-shell
> {code:java}
> scala> import org.apache.spark.sql.functions.{col, expr}
> import org.apache.spark.sql.functions.{col, expr}
> scala> import org.apache.spark.sql.streaming.Trigger
> import org.apache.spark.sql.streaming.Trigger
> scala> val lines_stream1 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic1").
>  |   option("includeTimestamp", true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data"),col("timestamp") 
> as("recordTime")).
>  |   select("data","recordTime").
>  |   withWatermark("recordTime", "5 seconds ")
> lines_stream1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data: string, recordTime: timestamp]
> scala> val lines_stream2 = spark.readStream.
>  |   format("kafka").
>  |   option("kafka.bootstrap.servers", "ip:9092").
>  |   option("subscribe", "topic2").
>  |   option("includeTimestamp", value = true).
>  |   load().
>  |   selectExpr("CAST (value AS String)","CAST(timestamp AS 
> TIMESTAMP)").as[(String,Timestamp)].
>  |   select(col("value") as("data1"),col("timestamp") 
> as("recordTime1")).
>  |   select("data1","recordTime1").
>  |   withWatermark("recordTime1", "10 seconds ")
> lines_stream2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [data1: string, recordTime1: timestamp]
> scala> val query = lines_stream1.join(lines_stream2, expr (
>  |   """
>  | | data == data1 and
>  | | recordTime1 >= recordTime and
>  | | recordTime1 <= recordTime + interval 5 seconds
>  |   """.stripMargin),"left").
>  |   writeStream.
>  |   option("truncate","false").
>  |   outputMode("append").
>  |   format("console").option("checkpointLocation", 
> "/tmp/leftouter/").
>  |   trigger(Trigger.ProcessingTime ("5 seconds")).
>  |   start()
> query: org.apache.spark.sql.streaming.StreamingQuery = 
> org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@1a48f55b
> {code}
> Step2 : Start producing data
> kafka-console-producer.sh --broker-list ip:9092 --topic topic1
>  >1
>  >2
>  >3
>  >4
>  >5
>  >aa
>  >bb
>  >cc
> 

[jira] [Assigned] (SPARK-30606) Applying the `like` function with 2 parameters fails

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30606:
-

Assignee: Maxim Gekk

> Applying the `like` function with 2 parameters fails
> 
>
> Key: SPARK-30606
> URL: https://issues.apache.org/jira/browse/SPARK-30606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The `Like` expression is registered as the `like` function. Applying the 
> function with 2 parameters fails with the exception:
> {code}
> spark-sql> select like('Spark', 'S%');
> Invalid arguments for function like; line 1 pos 7
> org.apache.spark.sql.AnalysisException: Invalid arguments for function like; 
> line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1412)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30606) Applying the `like` function with 2 parameters fails

2020-01-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30606.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27323
[https://github.com/apache/spark/pull/27323]

> Applying the `like` function with 2 parameters fails
> 
>
> Key: SPARK-30606
> URL: https://issues.apache.org/jira/browse/SPARK-30606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> The `Like` expression is registered as the `like` function. Applying the 
> function with 2 parameters fails with the exception:
> {code}
> spark-sql> select like('Spark', 'S%');
> Invalid arguments for function like; line 1 pos 7
> org.apache.spark.sql.AnalysisException: Invalid arguments for function like; 
> line 1 pos 7
>   at 
> org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1412)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30611) Update testthat dependency

2020-01-22 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-30611:
--

 Summary: Update testthat dependency
 Key: SPARK-30611
 URL: https://issues.apache.org/jira/browse/SPARK-30611
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, Tests
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


Currently SparkR fix {{testhat}} version to 1.0.2.

 

This has been introduced by SPARK-22817 to address removal of 
{{testthat}}:::run_tests in {{testthat}} 2.0.0.

 

As of 2.0.0 the replacement of {{run_tests}} is {{test_package_dir}}.  Same as 
{{run_tests}} it is considered internal, but it will allow us to upgrade to the 
latest {{testhat}} release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30610) spark worker graceful shutdown

2020-01-22 Thread t oo (Jira)
t oo created SPARK-30610:


 Summary: spark worker graceful shutdown
 Key: SPARK-30610
 URL: https://issues.apache.org/jira/browse/SPARK-30610
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.4.4
Reporter: t oo


I am not talking about spark streaming! just regular batch jobs using 
spark-submit that may try to read large csv (100+gb) then write it out as 
parquet. In an autoscaling cluster would be nice to be able to scale down (ie 
terminate) ec2s without slowing down active spark applications.

for example:
1. start spark cluster with 8 ec2s
2. submit 6 spark apps
3. 1 spark app completes, so 5 apps still running
4. cluster can scale down 1 ec2 (to save $) but don't want to make the existing 
apps running on the (soon to be terminated) ec2 have to make its csv read, RDD 
processing steps.etc start from the beginning on different ec2's executors. 
Instead want to have a 'graceful shutdown' command so that the 8th ec2 does not 
accept new spark-submit apps to it (ie don't start new executors on it) but 
finish the ones that have already launched on it, then exit the worker pid. 
then the ec2 can be terminated


I thought stop-slave.sh could do this but looks like it just kills the pid




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable

2020-01-22 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17021586#comment-17021586
 ] 

Reynold Xin commented on SPARK-29175:
-

I think the config should be more clear, e.g. 
"spark.sql.maven.additionalRemoteRepositories".

> Make maven central repository in IsolatedClientLoader configurable
> --
>
> Key: SPARK-29175
> URL: https://issues.apache.org/jira/browse/SPARK-29175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> We need to connect a central repository in IsolatedClientLoader for 
> downloading Hive jars. Here we added a new config 
> `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of 
> the optional additional remote maven mirror repositories, it can be used as 
> the additional remote repositories for the default maven central repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >