[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-14 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015705#comment-17015705
 ] 

Reynold Xin commented on SPARK-22231:
-

Hey sorry. Been pretty busy. I will take a look this week.

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 

[jira] [Created] (SPARK-30517) Support SHOW TABLES EXTENDED

2020-01-14 Thread Ajith S (Jira)
Ajith S created SPARK-30517:
---

 Summary: Support SHOW TABLES EXTENDED
 Key: SPARK-30517
 URL: https://issues.apache.org/jira/browse/SPARK-30517
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ajith S


{{Intention is to support show tables with a additional column 'type' where 
type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of 
required types, like listing only views or only external tables (using a 
'where' clause over 'type' column).}}

{{Usecase example:}}
{{Currently its not possible to list all the VIEWS, but other technologies like 
hive support it using 'SHOW VIEWS', mysql supports it using a more complex 
query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}}

Decide to take mysql approach as it provides more flexibility for querying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30505) Deprecate Avro option `ignoreExtension` in a doc

2020-01-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30505.
--
Fix Version/s: 3.0.0
 Assignee: Maxim Gekk
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/27194]

> Deprecate Avro option `ignoreExtension` in a doc
> 
>
> Key: SPARK-30505
> URL: https://issues.apache.org/jira/browse/SPARK-30505
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Update docs/sql-data-sources-avro.md and a sentence about deprecation of the 
> Avro option: ignoreExtension



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled

2020-01-14 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015693#comment-17015693
 ] 

Takeshi Yamamuro commented on SPARK-22184:
--

See https://github.com/apache/spark/pull/19410#issuecomment-574531717

> GraphX fails in case of insufficient memory and checkpoints enabled
> ---
>
> Key: SPARK-22184
> URL: https://issues.apache.org/jira/browse/SPARK-22184
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> GraphX fails with FileNotFoundException in case of insufficient memory when 
> checkpoints are enabled.
> Here is the stacktrace 
> {code}
> Job aborted due to stage failure: Task creation failed: 
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697)
> ...
> {code}
> As GraphX uses cached RDDs intensively, the issue is only reproducible when 
> previously cached and checkpointed Vertex and Edge RDDs are evicted from 
> memory and forced to be read from disk. 
> For testing purposes the following parameters may be set to emulate low 
> memory environment
> {code}
> val sparkConf = new SparkConf()
>   .set("spark.graphx.pregel.checkpointInterval", "2")
>   // set testing memory to evict cached RDDs from it and force
>   // reading checkpointed RDDs from disk
>   .set("spark.testing.reservedMemory", "128")
>   .set("spark.testing.memory", "256")
> {code}
> This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is 
> fixed too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled

2020-01-14 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-22184.
--
Resolution: Won't Fix

> GraphX fails in case of insufficient memory and checkpoints enabled
> ---
>
> Key: SPARK-22184
> URL: https://issues.apache.org/jira/browse/SPARK-22184
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Major
>
> GraphX fails with FileNotFoundException in case of insufficient memory when 
> checkpoints are enabled.
> Here is the stacktrace 
> {code}
> Job aborted due to stage failure: Task creation failed: 
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697)
> ...
> {code}
> As GraphX uses cached RDDs intensively, the issue is only reproducible when 
> previously cached and checkpointed Vertex and Edge RDDs are evicted from 
> memory and forced to be read from disk. 
> For testing purposes the following parameters may be set to emulate low 
> memory environment
> {code}
> val sparkConf = new SparkConf()
>   .set("spark.graphx.pregel.checkpointInterval", "2")
>   // set testing memory to evict cached RDDs from it and force
>   // reading checkpointed RDDs from disk
>   .set("spark.testing.reservedMemory", "128")
>   .set("spark.testing.memory", "256")
> {code}
> This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is 
> fixed too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30516) statistic estimation of FileScan should take partitionFilters and partition number into account

2020-01-14 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30516:
--
Summary: statistic estimation of FileScan should take partitionFilters and 
partition number into account  (was: FileScan.estimateStatistics does not take 
partitionFilters and partition number into account)

> statistic estimation of FileScan should take partitionFilters and partition 
> number into account
> ---
>
> Key: SPARK-30516
> URL: https://issues.apache.org/jira/browse/SPARK-30516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Currently, FileScan.estimateStatistics does not take partitionFilters and 
> partition number into account, which may lead to bigger sizeInBytes. It 
> should be reasonable to change it to involve partitionFilters and partition 
> number when estimating the statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account

2020-01-14 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30516:
--
Description: Currently, FileScan.estimateStatistics does not take 
partitionFilters and partition number into account, which may lead to bigger 
sizeInBytes. It should be reasonable to change it to involve partitionFilters 
and partition number when estimating the statistics.  (was: Currently, 
FileScan.estimateStatistics will not take partitionFilters into account, which 
may lead to bigger sizeInBytes. It should be reasonable to change it to involve 
partitionFilters and partition numbers when estimating the statistics.)

> FileScan.estimateStatistics does not take partitionFilters and partition 
> number into account
> 
>
> Key: SPARK-30516
> URL: https://issues.apache.org/jira/browse/SPARK-30516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Currently, FileScan.estimateStatistics does not take partitionFilters and 
> partition number into account, which may lead to bigger sizeInBytes. It 
> should be reasonable to change it to involve partitionFilters and partition 
> number when estimating the statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account

2020-01-14 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30516:
--
Description: Currently, FileScan.estimateStatistics will not take 
partitionFilters into account, which may lead to bigger sizeInBytes. It should 
be reasonable to change it to involve partitionFilters and partition numbers 
when estimating the statistics.  (was: Currently, FileScan.estimateStatistics 
will not take partitionFilters into account, which may lead to bigger 
sizeInBytes.

It should be reasonable to change it to involve partitionFilters and partition 
numbers when estimating the statistics.)

> FileScan.estimateStatistics does not take partitionFilters and partition 
> number into account
> 
>
> Key: SPARK-30516
> URL: https://issues.apache.org/jira/browse/SPARK-30516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Currently, FileScan.estimateStatistics will not take partitionFilters into 
> account, which may lead to bigger sizeInBytes. It should be reasonable to 
> change it to involve partitionFilters and partition numbers when estimating 
> the statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account

2020-01-14 Thread Hu Fuwang (Jira)
Hu Fuwang created SPARK-30516:
-

 Summary: FileScan.estimateStatistics does not take 
partitionFilters and partition number into account
 Key: SPARK-30516
 URL: https://issues.apache.org/jira/browse/SPARK-30516
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Hu Fuwang


Currently, FileScan.estimateStatistics will not take partitionFilters into 
account, which may lead to bigger sizeInBytes.

It should be reasonable to change it to involve partitionFilters and partition 
numbers when estimating the statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30020) ownerName and ownerType support as properties to tables

2020-01-14 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-30020.
--
   Fix Version/s: 3.0.0
Target Version/s: 3.0.0
  Resolution: Not A Problem

> ownerName and ownerType support as properties to tables
> ---
>
> Key: SPARK-30020
> URL: https://issues.apache.org/jira/browse/SPARK-30020
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Store the ownerName and ownerType in properties of tables not as field 
> members to achieve future-proof for catalog APIs. And those properties should 
> be reversed as internal ones for secure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce the time complexity

2020-01-14 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-30515:
---
Summary: Refactor SimplifyBinaryComparison to reduce the time complexity  
(was: Refactor SimplifyBinaryComparison to reduce time complexity)

> Refactor SimplifyBinaryComparison to reduce the time complexity
> ---
>
> Key: SPARK-30515
> URL: https://issues.apache.org/jira/browse/SPARK-30515
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> The improvement of the rule SimplifyBinaryComparison in PR 
> https://github.com/apache/spark/pull/27008 could bring performance regression 
> in the optimizer. 
> We need to improve the implementation and reduce the time complexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce time complexity

2020-01-14 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-30515:
--

 Summary: Refactor SimplifyBinaryComparison to reduce time 
complexity
 Key: SPARK-30515
 URL: https://issues.apache.org/jira/browse/SPARK-30515
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


The improvement of the rule SimplifyBinaryComparison in PR 
https://github.com/apache/spark/pull/27008 could bring performance regression 
in the optimizer. 
We need to improve the implementation and reduce the time complexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2020-01-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-22783.
--
Resolution: Duplicate

I'll mark this as "duplicated" as SPARK-28594 is making over half of progress.

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -
>
> Key: SPARK-22783
> URL: https://issues.apache.org/jira/browse/SPARK-22783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
> Environment: Linux(Generic)
>Reporter: omkar kankalapati
>Priority: Major
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234 /spark-history/.inprogress
> 46.6 G  /spark-history/.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on 
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30514) add ENV_PYSPARK_MAJOR_PYTHON_VERSION support for JavaMainAppResource

2020-01-14 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-30514:
--

 Summary: add ENV_PYSPARK_MAJOR_PYTHON_VERSION support for 
JavaMainAppResource
 Key: SPARK-30514
 URL: https://issues.apache.org/jira/browse/SPARK-30514
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0, 3.1.0
Reporter: Jackey Lee


In apache Livy, the program is first started with JavaMainAppResource, and then 
start the Python worker. At this time, the program needs to be able to pass 
Python environment variables.

In spark on yarn, we support it through spark.yarn.isPython. In k8s, we can 
support this in a better way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30513) Question about spark on k8s

2020-01-14 Thread Jackey Lee (Jira)
Jackey Lee created SPARK-30513:
--

 Summary: Question about spark on k8s
 Key: SPARK-30513
 URL: https://issues.apache.org/jira/browse/SPARK-30513
 Project: Spark
  Issue Type: Question
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Jackey Lee


My question is, why we wrote the domain name of Kube-DNS in the code? Isn't
it better to read domain name from the service, or just use the hostname?

In our scenario, we run spark on Kata-like containers, and found the code
had written the Kube-DNS domain. If Kube-DNS is not configured in
environment, tasks would run failed.

Besides, kube-dns is just a plugin for k8s, not a required component for k8s. 
We can use better DNS services without necessarily using this.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode

2020-01-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015501#comment-17015501
 ] 

Hyukjin Kwon commented on SPARK-30510:
--

I think currently only some of important configurations are documented in SQL 
guide, e.g. [https://spark.apache.org/docs/latest/sql-performance-tuning.html].
I was thinking about automatically generating a configuration page from SQLConf 
like Spark SQL built-in function docs; however, it might need some discussions 
and investigations.

> Document spark.sql.sources.partitionOverwriteMode
> -
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode

2020-01-14 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015473#comment-17015473
 ] 

Nicholas Chammas commented on SPARK-30510:
--

[~hyukjin.kwon] I think I'm missing something here because it seems that none 
of the {{spark.sql.*}} options in 
[SQLConf.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala]
 are on the configurations page. Are they published somewhere else?

> Document spark.sql.sources.partitionOverwriteMode
> -
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29721:
--
Affects Version/s: 3.0.0

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29721:
--
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Spark SQL reads unnecessary nested fields from Parquet after using explode
> --
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Kai Kang
>Priority: Major
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-14 Thread Chandni Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015362#comment-17015362
 ] 

Chandni Singh commented on SPARK-30512:
---

Please assign the issue to me so I can open up a PR.

> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
>   .childOption(ChannelOption.ALLOCATOR, allocator);
> {code}
> When the load at the shuffle service increases, since the worker threads are 
> busy with existing channels, registering new channels gets delayed.
> The fix is simple. I created a dedicated boss thread event loop group with 1 
> thread.
> {code:java}
> EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
>   conf.getModuleName() + "-boss");
> EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
> conf.serverThreads(),
> conf.getModuleName() + "-server");
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   .option(ChannelOption.ALLOCATOR, allocator)
> {code}
> This fixed the issue.
>  We just need 1 thread in the boss group because there is only a single 
> server bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-14 Thread Chandni Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chandni Singh updated SPARK-30512:
--
Description: 
We have been seeing a large number of SASL authentication (RPC requests) timing 
out with the external shuffle service.
 The issue and all the analysis we did is described here:
 [https://github.com/netty/netty/issues/9890]

I added a {{LoggingHandler}} to netty pipeline and realized that even the 
channel registration is delayed by 30 seconds. 
 In the Spark External Shuffle service, the boss event group and the worker 
event group are same which is causing this delay.
{code:java}
EventLoopGroup bossGroup =
  NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
conf.getModuleName() + "-server");
EventLoopGroup workerGroup = bossGroup;

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
  .childOption(ChannelOption.ALLOCATOR, allocator);
{code}
When the load at the shuffle service increases, since the worker threads are 
busy with existing channels, registering new channels gets delayed.

The fix is simple. I created a dedicated boss thread event loop group with 1 
thread.
{code:java}
EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
  conf.getModuleName() + "-boss");
EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
conf.serverThreads(),
conf.getModuleName() + "-server");

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
{code}
This fixed the issue.
 We just need 1 thread in the boss group because there is only a single server 
bootstrap.

 

  was:
We have been seeing a large number of SASL authentication (RPC requests) timing 
out with the external shuffle service.
 The issue and all the analysis we did is described here:
 [https://github.com/netty/netty/issues/9890]

I added a {{LoggingHandler}} to netty pipeline and realized that even the 
channel registration is delayed by 30 seconds. 
 In the Spark External Shuffle service, the boss event group and the worker 
event group are same which is causing this delay.
{code:java}
EventLoopGroup bossGroup =
  NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
conf.getModuleName() + "-server");
EventLoopGroup workerGroup = bossGroup;

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
  .childOption(ChannelOption.ALLOCATOR, allocator);
{code}
When the load at the shuffle service increases, since the worker threads are 
busy with existing channels, registering new channels gets delayed.

The fix is simple. I created a dedicated boss thread event loop group with 1 
thread.
{code:java}
EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
  conf.getModuleName() + "-boss");
EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
conf.serverThreads(),
conf.getModuleName() + "-server");

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
{code}
This fixed the issue.
 We just need 1 thread in the boss group because there is only a single server 
bootstrap.

Please assign the issue to me so I can open up a PR.


> Use a dedicated boss event group loop in the netty pipeline for external 
> shuffle service
> 
>
> Key: SPARK-30512
> URL: https://issues.apache.org/jira/browse/SPARK-30512
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Chandni Singh
>Priority: Major
>
> We have been seeing a large number of SASL authentication (RPC requests) 
> timing out with the external shuffle service.
>  The issue and all the analysis we did is described here:
>  [https://github.com/netty/netty/issues/9890]
> I added a {{LoggingHandler}} to netty pipeline and realized that even the 
> channel registration is delayed by 30 seconds. 
>  In the Spark External Shuffle service, the boss event group and the worker 
> event group are same which is causing this delay.
> {code:java}
> EventLoopGroup bossGroup =
>   NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
> conf.getModuleName() + "-server");
> EventLoopGroup workerGroup = bossGroup;
> bootstrap = new ServerBootstrap()
>   .group(bossGroup, workerGroup)
>   .channel(NettyUtils.getServerChannelClass(ioMode))
>   

[jira] [Created] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service

2020-01-14 Thread Chandni Singh (Jira)
Chandni Singh created SPARK-30512:
-

 Summary: Use a dedicated boss event group loop in the netty 
pipeline for external shuffle service
 Key: SPARK-30512
 URL: https://issues.apache.org/jira/browse/SPARK-30512
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Chandni Singh


We have been seeing a large number of SASL authentication (RPC requests) timing 
out with the external shuffle service.
 The issue and all the analysis we did is described here:
 [https://github.com/netty/netty/issues/9890]

I added a {{LoggingHandler}} to netty pipeline and realized that even the 
channel registration is delayed by 30 seconds. 
 In the Spark External Shuffle service, the boss event group and the worker 
event group are same which is causing this delay.
{code:java}
EventLoopGroup bossGroup =
  NettyUtils.createEventLoop(ioMode, conf.serverThreads(), 
conf.getModuleName() + "-server");
EventLoopGroup workerGroup = bossGroup;

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
  .childOption(ChannelOption.ALLOCATOR, allocator);
{code}
When the load at the shuffle service increases, since the worker threads are 
busy with existing channels, registering new channels gets delayed.

The fix is simple. I created a dedicated boss thread event loop group with 1 
thread.
{code:java}
EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1,
  conf.getModuleName() + "-boss");
EventLoopGroup workerGroup =  NettyUtils.createEventLoop(ioMode, 
conf.serverThreads(),
conf.getModuleName() + "-server");

bootstrap = new ServerBootstrap()
  .group(bossGroup, workerGroup)
  .channel(NettyUtils.getServerChannelClass(ioMode))
  .option(ChannelOption.ALLOCATOR, allocator)
{code}
This fixed the issue.
 We just need 1 thread in the boss group because there is only a single server 
bootstrap.

Please assign the issue to me so I can open up a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30509.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27200
[https://github.com/apache/spark/pull/27200]

> Deprecation log warning is not printed in Avro schema inferring
> ---
>
> Key: SPARK-30509
> URL: https://issues.apache.org/jira/browse/SPARK-30509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The bug can be reproduced by the test:
> {code}
>   test("log a warning of ignoreExtension deprecation") {
> val logAppender = new LogAppender
> withTempPath { dir =>
>   Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1))
> .toDF("value", "p1", "p2")
> .repartition(2)
> .write
> .format("avro")
> .option("header", true)
> .save(dir.getCanonicalPath)
>   withLogAppender(logAppender) {
> spark
>   .read
>   .format("avro")
>   .option(AvroOptions.ignoreExtensionKey, false)
>   .option("header", true)
>   .load(dir.getCanonicalPath)
>   .count()
>   }
>   val deprecatedEvents = logAppender.loggingEvents
> .filter(_.getRenderedMessage.contains(
>   s"Option ${AvroOptions.ignoreExtensionKey} is deprecated"))
>   assert(deprecatedEvents.size === 1)
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring

2020-01-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30509:
-

Assignee: Maxim Gekk

> Deprecation log warning is not printed in Avro schema inferring
> ---
>
> Key: SPARK-30509
> URL: https://issues.apache.org/jira/browse/SPARK-30509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The bug can be reproduced by the test:
> {code}
>   test("log a warning of ignoreExtension deprecation") {
> val logAppender = new LogAppender
> withTempPath { dir =>
>   Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1))
> .toDF("value", "p1", "p2")
> .repartition(2)
> .write
> .format("avro")
> .option("header", true)
> .save(dir.getCanonicalPath)
>   withLogAppender(logAppender) {
> spark
>   .read
>   .format("avro")
>   .option(AvroOptions.ignoreExtensionKey, false)
>   .option("header", true)
>   .load(dir.getCanonicalPath)
>   .count()
>   }
>   val deprecatedEvents = logAppender.loggingEvents
> .filter(_.getRenderedMessage.contains(
>   s"Option ${AvroOptions.ignoreExtensionKey} is deprecated"))
>   assert(deprecatedEvents.size === 1)
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-14 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
External issue ID:   (was: SPARK-2840)

> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-2840 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-14 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 

  was:
*TL;DR*
When speculative tasks finished/failed/got killed, they are still considered as 
pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:

 
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2

{code}
 

while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_:

 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

 

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-2840 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-14 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
External issue ID: SPARK-2840

> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-2840 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-14 Thread Zebing Lin (Jira)
Zebing Lin created SPARK-30511:
--

 Summary: Spark marks ended speculative tasks as pending leads to 
holding idle executors
 Key: SPARK-30511
 URL: https://issues.apache.org/jira/browse/SPARK-30511
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.3.0
Reporter: Zebing Lin


*TL;DR*
When speculative tasks finished/failed/got killed, they are still considered as 
pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:

 
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2

{code}
 

while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_:

 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

 

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

2020-01-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015335#comment-17015335
 ] 

Jungtaek Lim commented on SPARK-30495:
--

Adjusted priority as it's a regression. May need higher priority though, but 
given we have a PR closer to merge, major seems OK.

> How to disable 'spark.security.credentials.${service}.enabled' in Structured 
> streaming while connecting to a kafka cluster
> --
>
> Key: SPARK-30495
> URL: https://issues.apache.org/jira/browse/SPARK-30495
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: act_coder
>Priority: Major
>
> Trying to read data from a secured Kafka cluster using spark structured
>  streaming. Also, using the below library to read the data -
>  +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to
>  specify our custom group id (instead of spark setting its own custom group
>  id)
> +*Dependency used in code:*+
>         org.apache.spark
>          spark-sql-kafka-0-10_2.12
>          3.0.0-preview
>  
> +*Logs:*+
> Getting the below error - even after specifying the required JAAS
>  configuration in spark options.
> Caused by: java.lang.IllegalArgumentException: requirement failed:
>  *Delegation token must exist for this connector*. at
>  scala.Predef$.require(Predef.scala:281) at
> org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
>  
> +*Spark configuration used to read from Kafka:*+
> val kafkaDF = sparkSession.readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", bootStrapServer)
>  .option("subscribe", kafkaTopic )
>  
> //Setting JAAS Configuration
> .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL)
>  .option("kafka.sasl.mechanism", "PLAIN")
>  .option("kafka.security.protocol", "SASL_SSL")
> // Setting custom consumer group id
> .option("kafka.group.id", "test_cg")
>  .load()
>  
> Following document specifies that we can disable the feature of obtaining
>  delegation token -
>  
> [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html]
> Tried setting this property *spark.security.credentials.kafka.enabled to*
>  *false in spark config,* but it is still failing with the same error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

2020-01-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-30495:
-
Priority: Major  (was: Minor)

> How to disable 'spark.security.credentials.${service}.enabled' in Structured 
> streaming while connecting to a kafka cluster
> --
>
> Key: SPARK-30495
> URL: https://issues.apache.org/jira/browse/SPARK-30495
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: act_coder
>Priority: Major
>
> Trying to read data from a secured Kafka cluster using spark structured
>  streaming. Also, using the below library to read the data -
>  +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to
>  specify our custom group id (instead of spark setting its own custom group
>  id)
> +*Dependency used in code:*+
>         org.apache.spark
>          spark-sql-kafka-0-10_2.12
>          3.0.0-preview
>  
> +*Logs:*+
> Getting the below error - even after specifying the required JAAS
>  configuration in spark options.
> Caused by: java.lang.IllegalArgumentException: requirement failed:
>  *Delegation token must exist for this connector*. at
>  scala.Predef$.require(Predef.scala:281) at
> org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
>  
> +*Spark configuration used to read from Kafka:*+
> val kafkaDF = sparkSession.readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", bootStrapServer)
>  .option("subscribe", kafkaTopic )
>  
> //Setting JAAS Configuration
> .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL)
>  .option("kafka.sasl.mechanism", "PLAIN")
>  .option("kafka.security.protocol", "SASL_SSL")
> // Setting custom consumer group id
> .option("kafka.group.id", "test_cg")
>  .load()
>  
> Following document specifies that we can disable the feature of obtaining
>  delegation token -
>  
> [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html]
> Tried setting this property *spark.security.credentials.kafka.enabled to*
>  *false in spark config,* but it is still failing with the same error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30488) Deadlock between block-manager-slave-async-thread-pool and spark context cleaner

2020-01-14 Thread Rohit Agrawal (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015308#comment-17015308
 ] 

Rohit Agrawal commented on SPARK-30488:
---

[~ajithshetty]

We use the following to create spark context:

 

SparkSession.builder().config(finalSparkConf).getOrCreate()

> Deadlock between block-manager-slave-async-thread-pool and spark context 
> cleaner
> 
>
> Key: SPARK-30488
> URL: https://issues.apache.org/jira/browse/SPARK-30488
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Rohit Agrawal
>Priority: Major
>
> Deadlock happens while cleaning up the spark context. Here is the full thread 
> dump:
>  
>   
>   2020-01-10T20:13:16.2884057Z Full thread dump Java HotSpot(TM) 64-Bit 
> Server VM (25.221-b11 mixed mode):
> 2020-01-10T20:13:16.2884392Z 
> 2020-01-10T20:13:16.2884660Z "SIGINT handler" #488 daemon prio=9 os_prio=2 
> tid=0x111fa000 nid=0x4794 waiting for monitor entry 
> [0x1c86e000]
> 2020-01-10T20:13:16.2884807Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2884879Z at java.lang.Shutdown.exit(Shutdown.java:212)
> 2020-01-10T20:13:16.2885693Z - waiting to lock <0xc0155de0> (a 
> java.lang.Class for java.lang.Shutdown)
> 2020-01-10T20:13:16.2885840Z at 
> java.lang.Terminator$1.handle(Terminator.java:52)
> 2020-01-10T20:13:16.2885965Z at sun.misc.Signal$1.run(Signal.java:212)
> 2020-01-10T20:13:16.2886329Z at java.lang.Thread.run(Thread.java:748)
> 2020-01-10T20:13:16.2886430Z 
> 2020-01-10T20:13:16.2886752Z "Thread-3" #108 prio=5 os_prio=0 
> tid=0x111f7800 nid=0x48cc waiting for monitor entry 
> [0x2c33f000]
> 2020-01-10T20:13:16.2886881Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2886999Z at 
> org.apache.hadoop.util.ShutdownHookManager.getShutdownHooksInOrder(ShutdownHookManager.java:273)
> 2020-01-10T20:13:16.2887107Z at 
> org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:121)
> 2020-01-10T20:13:16.2887212Z at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95)
> 2020-01-10T20:13:16.2887421Z 
> 2020-01-10T20:13:16.2887798Z "block-manager-slave-async-thread-pool-81" #486 
> daemon prio=5 os_prio=0 tid=0x111fe800 nid=0x2e34 waiting for monitor 
> entry [0x2bf3d000]
> 2020-01-10T20:13:16.2889192Z java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-01-10T20:13:16.2889305Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:404)
> 2020-01-10T20:13:16.2889405Z - waiting to lock <0xc1f359f0> (a 
> sbt.internal.LayeredClassLoader)
> 2020-01-10T20:13:16.2889482Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:411)
> 2020-01-10T20:13:16.2889582Z - locked <0xca33e4c8> (a 
> sbt.internal.ManagedClassLoader$ZombieClassLoader)
> 2020-01-10T20:13:16.2889659Z at 
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> 2020-01-10T20:13:16.2890881Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:58)
> 2020-01-10T20:13:16.2891006Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57)
> 2020-01-10T20:13:16.2891142Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57)
> 2020-01-10T20:13:16.2891260Z at 
> org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:86)
> 2020-01-10T20:13:16.2891375Z at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> 2020-01-10T20:13:16.2891624Z at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> 2020-01-10T20:13:16.2891737Z at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2020-01-10T20:13:16.2891833Z at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-01-10T20:13:16.2891925Z at java.lang.Thread.run(Thread.java:748)
> 2020-01-10T20:13:16.2891967Z 
> 2020-01-10T20:13:16.2892066Z "pool-31-thread-16" #335 prio=5 os_prio=0 
> tid=0x153b2000 nid=0x1aac waiting on condition [0x4b2ff000]
> 2020-01-10T20:13:16.2892147Z java.lang.Thread.State: WAITING (parking)
> 2020-01-10T20:13:16.2892241Z at sun.misc.Unsafe.park(Native Method)
> 2020-01-10T20:13:16.2892328Z - parking to wait for <0xc9cad078> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-01-10T20:13:16.2892437Z at 
> 

[jira] [Created] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode

2020-01-14 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30510:


 Summary: Document spark.sql.sources.partitionOverwriteMode
 Key: SPARK-30510
 URL: https://issues.apache.org/jira/browse/SPARK-30510
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
but it doesn't appear to be documented in [the expected 
place|http://spark.apache.org/docs/2.4.4/configuration.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information

2020-01-14 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-27142:
--

Assignee: Ajith S

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Minor
> Attachments: image-2019-03-13-19-29-26-896.png
>
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found
>  
> Details: 
> https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27142) Provide REST API for SQL level information

2020-01-14 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-27142.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24076
[https://github.com/apache/spark/pull/24076]

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: image-2019-03-13-19-29-26-896.png
>
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found
>  
> Details: 
> https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30424) Change ExpressionEncoder toRow method to return UnsafeRow

2020-01-14 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015183#comment-17015183
 ] 

Erik Erlandson commented on SPARK-30424:


The main place this change causes a compile fail on is in SparkSession:

 
{code:java}
def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame{code}
And the key RDD impacted is LogicalRDD.

What I'm wondering is whether it is appropriate to change the signature of the 
RDD in LogicalRDD from InternalRow to the more specific UnsafeRow. My intuition 
is no, however it's also true that this is what's actually occurring under the 
hood currently, so I'm curious what the catalyst maintainers think about it.

 

 

> Change ExpressionEncoder toRow method to return UnsafeRow
> -
>
> Key: SPARK-30424
> URL: https://issues.apache.org/jira/browse/SPARK-30424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Minor
>
> [~wenchen] observed that the toRow() method on ExpressionEncoder can have its 
> return type specified as UnsafeRow. See discussion on 
> [https://github.com/apache/spark/pull/25024] 
>  
> Not a high priority but could be done for 3.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9478) Add sample weights to Random Forest

2020-01-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-9478.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9478) Add sample weights to Random Forest

2020-01-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reopened SPARK-9478:
-

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>Assignee: zhengruifeng
>Priority: Major
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9478) Add sample weights to Random Forest

2020-01-14 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-9478:
---

Assignee: zhengruifeng

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>Assignee: zhengruifeng
>Priority: Major
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30423) Deprecate UserDefinedAggregateFunction

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30423.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27193
[https://github.com/apache/spark/pull/27193]

> Deprecate UserDefinedAggregateFunction
> --
>
> Key: SPARK-30423
> URL: https://issues.apache.org/jira/browse/SPARK-30423
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Major
> Fix For: 3.0.0
>
>
> Anticipating the merging of SPARK-27296, the legacy methodology for 
> implementing custom user defined aggregators over untyped DataFrame based on 
> UserDefinedAggregateFunction will be made obsolete. This class should be 
> annotated as deprecated once the new capability is officially merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29544.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26434
[https://github.com/apache/spark/pull/26434]

> Optimize skewed join at runtime with new Adaptive Execution
> ---
>
> Key: SPARK-29544
> URL: https://issues.apache.org/jira/browse/SPARK-29544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Skewed Join Optimization Design Doc.docx
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
> used to handle the skew join optimization based on the runtime statistics 
> (data size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29544:
---

Assignee: Ke Jia

> Optimize skewed join at runtime with new Adaptive Execution
> ---
>
> Key: SPARK-29544
> URL: https://issues.apache.org/jira/browse/SPARK-29544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Attachments: Skewed Join Optimization Design Doc.docx
>
>
> Implement a rule in the new adaptive execution framework introduced in 
> [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is 
> used to handle the skew join optimization based on the runtime statistics 
> (data size and row count).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30325:
---

Assignee: haiyangyu

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: haiyangyu
>Assignee: haiyangyu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30325.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26975
[https://github.com/apache/spark/pull/26975]

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: haiyangyu
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30295) Remove Hive dependencies from SparkSQLCLI

2020-01-14 Thread Javier Fuentes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015028#comment-17015028
 ] 

Javier Fuentes commented on SPARK-30295:


 Yes, this purely to try to remove more hive dependencies and reduce the 
footprint replacing exisisting java code with scala implementations.

> Remove Hive dependencies from SparkSQLCLI
> -
>
> Key: SPARK-30295
> URL: https://issues.apache.org/jira/browse/SPARK-30295
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Javier Fuentes
>Priority: Major
>
> Removal of unnecessary hive dependencies for the Spark SQL Client. Replacing 
> that with a native Scala implementation. For the client driver, argument 
> parser and SparkSqlCliDriver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring

2020-01-14 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30509:
--

 Summary: Deprecation log warning is not printed in Avro schema 
inferring
 Key: SPARK-30509
 URL: https://issues.apache.org/jira/browse/SPARK-30509
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The bug can be reproduced by the test:
{code}
  test("log a warning of ignoreExtension deprecation") {
val logAppender = new LogAppender
withTempPath { dir =>
  Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1))
.toDF("value", "p1", "p2")
.repartition(2)
.write
.format("avro")
.option("header", true)
.save(dir.getCanonicalPath)
  withLogAppender(logAppender) {
spark
  .read
  .format("avro")
  .option(AvroOptions.ignoreExtensionKey, false)
  .option("header", true)
  .load(dir.getCanonicalPath)
  .count()
  }
  val deprecatedEvents = logAppender.loggingEvents
.filter(_.getRenderedMessage.contains(
  s"Option ${AvroOptions.ignoreExtensionKey} is deprecated"))
  assert(deprecatedEvents.size === 1)
}
  }
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30498) Fix some ml parity issues between python and scala

2020-01-14 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30498:


Assignee: Huaxin Gao

> Fix some ml parity issues between python and scala
> --
>
> Key: SPARK-30498
> URL: https://issues.apache.org/jira/browse/SPARK-30498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> remove setters in CrossValidatorModel and TrainValidationSplitModel in 
> tuning.py. use _set to set the params. 
> add setInputCol/setOutputCol in Python ImputerModel. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30498) Fix some ml parity issues between python and scala

2020-01-14 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30498.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27196
[https://github.com/apache/spark/pull/27196]

> Fix some ml parity issues between python and scala
> --
>
> Key: SPARK-30498
> URL: https://issues.apache.org/jira/browse/SPARK-30498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> remove setters in CrossValidatorModel and TrainValidationSplitModel in 
> tuning.py. use _set to set the params. 
> add setInputCol/setOutputCol in Python ImputerModel. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30292.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26933
[https://github.com/apache/spark/pull/26933]

> Throw Exception when invalid string is cast to decimal in ANSI mode
> ---
>
> Key: SPARK-30292
> URL: https://issues.apache.org/jira/browse/SPARK-30292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> When spark.sql.ansi.enabled is set,
> If we run select cast('str' as decimal), spark-sql outputs NULL. 
> The ANSI SQL standard requires to throw exception when invalid strings are 
> cast to numbers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode

2020-01-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30292:
---

Assignee: Rakesh Raushan

> Throw Exception when invalid string is cast to decimal in ANSI mode
> ---
>
> Key: SPARK-30292
> URL: https://issues.apache.org/jira/browse/SPARK-30292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Assignee: Rakesh Raushan
>Priority: Minor
>
> When spark.sql.ansi.enabled is set,
> If we run select cast('str' as decimal), spark-sql outputs NULL. 
> The ANSI SQL standard requires to throw exception when invalid strings are 
> cast to numbers.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30508) Add DataFrameReader.executeCommand API for external datasource

2020-01-14 Thread wuyi (Jira)
wuyi created SPARK-30508:


 Summary: Add DataFrameReader.executeCommand API for external 
datasource
 Key: SPARK-30508
 URL: https://issues.apache.org/jira/browse/SPARK-30508
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Add DataFrameReader.executeCommand API for external datasource in order to make 
external datasources be able to execute some custom DDL/DML commands.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28242) DataStreamer keeps logging errors even after fixing writeStream output sink

2020-01-14 Thread Hyokun Park (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014919#comment-17014919
 ] 

Hyokun Park commented on SPARK-28242:
-

Hi [~mcanes]

In my case, I resolved the problem by adding a configuration.

Please add this configuration "--conf 
spark.hadoop.dfs.client.block.write.replace-datanode-on-failure.enable=false" 
in your spark-submit command.

> DataStreamer keeps logging errors even after fixing writeStream output sink
> ---
>
> Key: SPARK-28242
> URL: https://issues.apache.org/jira/browse/SPARK-28242
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Hadoop 2.8.4
>  
>Reporter: Miquel Canes
>Priority: Minor
>
> I have been testing what happens to a running structured streaming that is 
> writing to HDFS when all datanodes are down/stopped or all cluster is down 
> (including namenode)
> So I created a structured stream from kafka to a File output sink to HDFS and 
> tested some scenarios.
> We used a very simple streamings:
> {code:java}
> spark.readStream()
> .format("kafka")
> .option("kafka.bootstrap.servers", "kafka.server:9092...")
> .option("subscribe", "test_topic")
> .load()
> .select(col("value").cast(DataTypes.StringType))
> .writeStream()
> .format("text")
> .option("path", "HDFS/PATH")
> .option("checkpointLocation", "checkpointPath")
> .start()
> .awaitTermination();{code}
>  
> After stopping all the datanodes the process starts logging the error that 
> datanodes are bad.
> That's correct...
> {code:java}
> 2019-07-03 15:55:00 [spark-listener-group-eventLog] ERROR 
> org.apache.spark.scheduler.AsyncEventQueue:91 - Listener EventLoggingListener 
> threw an exception java.io.IOException: All datanodes 
> [DatanodeInfoWithStorage[10.2.12.202:50010,DS-d2fba01b-28eb-4fe4-baaa-4072102a2172,DISK]]
>  are bad. Aborting... at 
> org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530) 
> at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
> {code}
> The problem is that even after starting again the datanodes the process keeps 
> logging the same error all the time.
> We checked and the WriteStream to HDFS recovered successfully after starting 
> the datanodes and the output sink worked again without problems.
> I have been trying some different HDFS configurations to be sure it's not a 
> client config related problem but with no clue about how to fix it.
> It seams that something is stuck indefinitely in an error loop.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org