[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015705#comment-17015705 ] Reynold Xin commented on SPARK-22231: - Hey sorry. Been pretty busy. I will take a look this week. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > // +---++--+ > // |10 |10.0|[[10,11.0], [11,12.0]]| > // |20
[jira] [Created] (SPARK-30517) Support SHOW TABLES EXTENDED
Ajith S created SPARK-30517: --- Summary: Support SHOW TABLES EXTENDED Key: SPARK-30517 URL: https://issues.apache.org/jira/browse/SPARK-30517 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Ajith S {{Intention is to support show tables with a additional column 'type' where type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of required types, like listing only views or only external tables (using a 'where' clause over 'type' column).}} {{Usecase example:}} {{Currently its not possible to list all the VIEWS, but other technologies like hive support it using 'SHOW VIEWS', mysql supports it using a more complex query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}} Decide to take mysql approach as it provides more flexibility for querying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30505) Deprecate Avro option `ignoreExtension` in a doc
[ https://issues.apache.org/jira/browse/SPARK-30505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30505. -- Fix Version/s: 3.0.0 Assignee: Maxim Gekk Resolution: Fixed Fixed in [https://github.com/apache/spark/pull/27194] > Deprecate Avro option `ignoreExtension` in a doc > > > Key: SPARK-30505 > URL: https://issues.apache.org/jira/browse/SPARK-30505 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Update docs/sql-data-sources-avro.md and a sentence about deprecation of the > Avro option: ignoreExtension -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled
[ https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015693#comment-17015693 ] Takeshi Yamamuro commented on SPARK-22184: -- See https://github.com/apache/spark/pull/19410#issuecomment-574531717 > GraphX fails in case of insufficient memory and checkpoints enabled > --- > > Key: SPARK-22184 > URL: https://issues.apache.org/jira/browse/SPARK-22184 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.2.0 > Environment: spark 2.2.0 > scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > GraphX fails with FileNotFoundException in case of insufficient memory when > checkpoints are enabled. > Here is the stacktrace > {code} > Job aborted due to stage failure: Task creation failed: > java.io.FileNotFoundException: File > file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0 > does not exist > java.io.FileNotFoundException: File > file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) > at > org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697) > ... > {code} > As GraphX uses cached RDDs intensively, the issue is only reproducible when > previously cached and checkpointed Vertex and Edge RDDs are evicted from > memory and forced to be read from disk. > For testing purposes the following parameters may be set to emulate low > memory environment > {code} > val sparkConf = new SparkConf() > .set("spark.graphx.pregel.checkpointInterval", "2") > // set testing memory to evict cached RDDs from it and force > // reading checkpointed RDDs from disk > .set("spark.testing.reservedMemory", "128") > .set("spark.testing.memory", "256") > {code} > This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is > fixed too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled
[ https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-22184. -- Resolution: Won't Fix > GraphX fails in case of insufficient memory and checkpoints enabled > --- > > Key: SPARK-22184 > URL: https://issues.apache.org/jira/browse/SPARK-22184 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.2.0 > Environment: spark 2.2.0 > scala 2.11 >Reporter: Sergey Zhemzhitsky >Priority: Major > > GraphX fails with FileNotFoundException in case of insufficient memory when > checkpoints are enabled. > Here is the stacktrace > {code} > Job aborted due to stage failure: Task creation failed: > java.io.FileNotFoundException: File > file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0 > does not exist > java.io.FileNotFoundException: File > file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) > at > org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274) > at scala.Option.map(Option.scala:146) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697) > ... > {code} > As GraphX uses cached RDDs intensively, the issue is only reproducible when > previously cached and checkpointed Vertex and Edge RDDs are evicted from > memory and forced to be read from disk. > For testing purposes the following parameters may be set to emulate low > memory environment > {code} > val sparkConf = new SparkConf() > .set("spark.graphx.pregel.checkpointInterval", "2") > // set testing memory to evict cached RDDs from it and force > // reading checkpointed RDDs from disk > .set("spark.testing.reservedMemory", "128") > .set("spark.testing.memory", "256") > {code} > This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is > fixed too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30516) statistic estimation of FileScan should take partitionFilters and partition number into account
[ https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30516: -- Summary: statistic estimation of FileScan should take partitionFilters and partition number into account (was: FileScan.estimateStatistics does not take partitionFilters and partition number into account) > statistic estimation of FileScan should take partitionFilters and partition > number into account > --- > > Key: SPARK-30516 > URL: https://issues.apache.org/jira/browse/SPARK-30516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hu Fuwang >Priority: Major > > Currently, FileScan.estimateStatistics does not take partitionFilters and > partition number into account, which may lead to bigger sizeInBytes. It > should be reasonable to change it to involve partitionFilters and partition > number when estimating the statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account
[ https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30516: -- Description: Currently, FileScan.estimateStatistics does not take partitionFilters and partition number into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters and partition number when estimating the statistics. (was: Currently, FileScan.estimateStatistics will not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters and partition numbers when estimating the statistics.) > FileScan.estimateStatistics does not take partitionFilters and partition > number into account > > > Key: SPARK-30516 > URL: https://issues.apache.org/jira/browse/SPARK-30516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hu Fuwang >Priority: Major > > Currently, FileScan.estimateStatistics does not take partitionFilters and > partition number into account, which may lead to bigger sizeInBytes. It > should be reasonable to change it to involve partitionFilters and partition > number when estimating the statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account
[ https://issues.apache.org/jira/browse/SPARK-30516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hu Fuwang updated SPARK-30516: -- Description: Currently, FileScan.estimateStatistics will not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters and partition numbers when estimating the statistics. (was: Currently, FileScan.estimateStatistics will not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters and partition numbers when estimating the statistics.) > FileScan.estimateStatistics does not take partitionFilters and partition > number into account > > > Key: SPARK-30516 > URL: https://issues.apache.org/jira/browse/SPARK-30516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Hu Fuwang >Priority: Major > > Currently, FileScan.estimateStatistics will not take partitionFilters into > account, which may lead to bigger sizeInBytes. It should be reasonable to > change it to involve partitionFilters and partition numbers when estimating > the statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30516) FileScan.estimateStatistics does not take partitionFilters and partition number into account
Hu Fuwang created SPARK-30516: - Summary: FileScan.estimateStatistics does not take partitionFilters and partition number into account Key: SPARK-30516 URL: https://issues.apache.org/jira/browse/SPARK-30516 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Hu Fuwang Currently, FileScan.estimateStatistics will not take partitionFilters into account, which may lead to bigger sizeInBytes. It should be reasonable to change it to involve partitionFilters and partition numbers when estimating the statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30020) ownerName and ownerType support as properties to tables
[ https://issues.apache.org/jira/browse/SPARK-30020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-30020. -- Fix Version/s: 3.0.0 Target Version/s: 3.0.0 Resolution: Not A Problem > ownerName and ownerType support as properties to tables > --- > > Key: SPARK-30020 > URL: https://issues.apache.org/jira/browse/SPARK-30020 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Store the ownerName and ownerType in properties of tables not as field > members to achieve future-proof for catalog APIs. And those properties should > be reversed as internal ones for secure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce the time complexity
[ https://issues.apache.org/jira/browse/SPARK-30515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-30515: --- Summary: Refactor SimplifyBinaryComparison to reduce the time complexity (was: Refactor SimplifyBinaryComparison to reduce time complexity) > Refactor SimplifyBinaryComparison to reduce the time complexity > --- > > Key: SPARK-30515 > URL: https://issues.apache.org/jira/browse/SPARK-30515 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > The improvement of the rule SimplifyBinaryComparison in PR > https://github.com/apache/spark/pull/27008 could bring performance regression > in the optimizer. > We need to improve the implementation and reduce the time complexity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce time complexity
Gengliang Wang created SPARK-30515: -- Summary: Refactor SimplifyBinaryComparison to reduce time complexity Key: SPARK-30515 URL: https://issues.apache.org/jira/browse/SPARK-30515 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang The improvement of the rule SimplifyBinaryComparison in PR https://github.com/apache/spark/pull/27008 could bring performance regression in the optimizer. We need to improve the implementation and reduce the time complexity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-22783. -- Resolution: Duplicate I'll mark this as "duplicated" as SPARK-28594 is making over half of progress. > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > - > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) >Reporter: omkar kankalapati >Priority: Major > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/.inprogress > 46.6 G /spark-history/.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30514) add ENV_PYSPARK_MAJOR_PYTHON_VERSION support for JavaMainAppResource
Jackey Lee created SPARK-30514: -- Summary: add ENV_PYSPARK_MAJOR_PYTHON_VERSION support for JavaMainAppResource Key: SPARK-30514 URL: https://issues.apache.org/jira/browse/SPARK-30514 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0, 3.1.0 Reporter: Jackey Lee In apache Livy, the program is first started with JavaMainAppResource, and then start the Python worker. At this time, the program needs to be able to pass Python environment variables. In spark on yarn, we support it through spark.yarn.isPython. In k8s, we can support this in a better way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30513) Question about spark on k8s
Jackey Lee created SPARK-30513: -- Summary: Question about spark on k8s Key: SPARK-30513 URL: https://issues.apache.org/jira/browse/SPARK-30513 Project: Spark Issue Type: Question Components: Kubernetes Affects Versions: 3.0.0 Reporter: Jackey Lee My question is, why we wrote the domain name of Kube-DNS in the code? Isn't it better to read domain name from the service, or just use the hostname? In our scenario, we run spark on Kata-like containers, and found the code had written the Kube-DNS domain. If Kube-DNS is not configured in environment, tasks would run failed. Besides, kube-dns is just a plugin for k8s, not a required component for k8s. We can use better DNS services without necessarily using this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode
[ https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015501#comment-17015501 ] Hyukjin Kwon commented on SPARK-30510: -- I think currently only some of important configurations are documented in SQL guide, e.g. [https://spark.apache.org/docs/latest/sql-performance-tuning.html]. I was thinking about automatically generating a configuration page from SQLConf like Spark SQL built-in function docs; however, it might need some discussions and investigations. > Document spark.sql.sources.partitionOverwriteMode > - > > Key: SPARK-30510 > URL: https://issues.apache.org/jira/browse/SPARK-30510 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, > but it doesn't appear to be documented in [the expected > place|http://spark.apache.org/docs/2.4.4/configuration.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode
[ https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015473#comment-17015473 ] Nicholas Chammas commented on SPARK-30510: -- [~hyukjin.kwon] I think I'm missing something here because it seems that none of the {{spark.sql.*}} options in [SQLConf.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala] are on the configurations page. Are they published somewhere else? > Document spark.sql.sources.partitionOverwriteMode > - > > Key: SPARK-30510 > URL: https://issues.apache.org/jira/browse/SPARK-30510 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 2.4.4 >Reporter: Nicholas Chammas >Priority: Minor > > SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, > but it doesn't appear to be documented in [the expected > place|http://spark.apache.org/docs/2.4.4/configuration.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29721: -- Affects Version/s: 3.0.0 > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Kai Kang >Priority: Major > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29721) Spark SQL reads unnecessary nested fields from Parquet after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29721: -- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 > Spark SQL reads unnecessary nested fields from Parquet after using explode > -- > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 >Reporter: Kai Kang >Priority: Major > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015362#comment-17015362 ] Chandni Singh commented on SPARK-30512: --- Please assign the issue to me so I can open up a PR. > Use a dedicated boss event group loop in the netty pipeline for external > shuffle service > > > Key: SPARK-30512 > URL: https://issues.apache.org/jira/browse/SPARK-30512 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Chandni Singh >Priority: Major > > We have been seeing a large number of SASL authentication (RPC requests) > timing out with the external shuffle service. > The issue and all the analysis we did is described here: > [https://github.com/netty/netty/issues/9890] > I added a {{LoggingHandler}} to netty pipeline and realized that even the > channel registration is delayed by 30 seconds. > In the Spark External Shuffle service, the boss event group and the worker > event group are same which is causing this delay. > {code:java} > EventLoopGroup bossGroup = > NettyUtils.createEventLoop(ioMode, conf.serverThreads(), > conf.getModuleName() + "-server"); > EventLoopGroup workerGroup = bossGroup; > bootstrap = new ServerBootstrap() > .group(bossGroup, workerGroup) > .channel(NettyUtils.getServerChannelClass(ioMode)) > .option(ChannelOption.ALLOCATOR, allocator) > .childOption(ChannelOption.ALLOCATOR, allocator); > {code} > When the load at the shuffle service increases, since the worker threads are > busy with existing channels, registering new channels gets delayed. > The fix is simple. I created a dedicated boss thread event loop group with 1 > thread. > {code:java} > EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, > conf.getModuleName() + "-boss"); > EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, > conf.serverThreads(), > conf.getModuleName() + "-server"); > bootstrap = new ServerBootstrap() > .group(bossGroup, workerGroup) > .channel(NettyUtils.getServerChannelClass(ioMode)) > .option(ChannelOption.ALLOCATOR, allocator) > {code} > This fixed the issue. > We just need 1 thread in the boss group because there is only a single > server bootstrap. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service
[ https://issues.apache.org/jira/browse/SPARK-30512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated SPARK-30512: -- Description: We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service. The issue and all the analysis we did is described here: [https://github.com/netty/netty/issues/9890] I added a {{LoggingHandler}} to netty pipeline and realized that even the channel registration is delayed by 30 seconds. In the Spark External Shuffle service, the boss event group and the worker event group are same which is causing this delay. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); EventLoopGroup workerGroup = bossGroup; bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) .childOption(ChannelOption.ALLOCATOR, allocator); {code} When the load at the shuffle service increases, since the worker threads are busy with existing channels, registering new channels gets delayed. The fix is simple. I created a dedicated boss thread event loop group with 1 thread. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, conf.getModuleName() + "-boss"); EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) {code} This fixed the issue. We just need 1 thread in the boss group because there is only a single server bootstrap. was: We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service. The issue and all the analysis we did is described here: [https://github.com/netty/netty/issues/9890] I added a {{LoggingHandler}} to netty pipeline and realized that even the channel registration is delayed by 30 seconds. In the Spark External Shuffle service, the boss event group and the worker event group are same which is causing this delay. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); EventLoopGroup workerGroup = bossGroup; bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) .childOption(ChannelOption.ALLOCATOR, allocator); {code} When the load at the shuffle service increases, since the worker threads are busy with existing channels, registering new channels gets delayed. The fix is simple. I created a dedicated boss thread event loop group with 1 thread. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, conf.getModuleName() + "-boss"); EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) {code} This fixed the issue. We just need 1 thread in the boss group because there is only a single server bootstrap. Please assign the issue to me so I can open up a PR. > Use a dedicated boss event group loop in the netty pipeline for external > shuffle service > > > Key: SPARK-30512 > URL: https://issues.apache.org/jira/browse/SPARK-30512 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Chandni Singh >Priority: Major > > We have been seeing a large number of SASL authentication (RPC requests) > timing out with the external shuffle service. > The issue and all the analysis we did is described here: > [https://github.com/netty/netty/issues/9890] > I added a {{LoggingHandler}} to netty pipeline and realized that even the > channel registration is delayed by 30 seconds. > In the Spark External Shuffle service, the boss event group and the worker > event group are same which is causing this delay. > {code:java} > EventLoopGroup bossGroup = > NettyUtils.createEventLoop(ioMode, conf.serverThreads(), > conf.getModuleName() + "-server"); > EventLoopGroup workerGroup = bossGroup; > bootstrap = new ServerBootstrap() > .group(bossGroup, workerGroup) > .channel(NettyUtils.getServerChannelClass(ioMode)) >
[jira] [Created] (SPARK-30512) Use a dedicated boss event group loop in the netty pipeline for external shuffle service
Chandni Singh created SPARK-30512: - Summary: Use a dedicated boss event group loop in the netty pipeline for external shuffle service Key: SPARK-30512 URL: https://issues.apache.org/jira/browse/SPARK-30512 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.0.0 Reporter: Chandni Singh We have been seeing a large number of SASL authentication (RPC requests) timing out with the external shuffle service. The issue and all the analysis we did is described here: [https://github.com/netty/netty/issues/9890] I added a {{LoggingHandler}} to netty pipeline and realized that even the channel registration is delayed by 30 seconds. In the Spark External Shuffle service, the boss event group and the worker event group are same which is causing this delay. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); EventLoopGroup workerGroup = bossGroup; bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) .childOption(ChannelOption.ALLOCATOR, allocator); {code} When the load at the shuffle service increases, since the worker threads are busy with existing channels, registering new channels gets delayed. The fix is simple. I created a dedicated boss thread event loop group with 1 thread. {code:java} EventLoopGroup bossGroup = NettyUtils.createEventLoop(ioMode, 1, conf.getModuleName() + "-boss"); EventLoopGroup workerGroup = NettyUtils.createEventLoop(ioMode, conf.serverThreads(), conf.getModuleName() + "-server"); bootstrap = new ServerBootstrap() .group(bossGroup, workerGroup) .channel(NettyUtils.getServerChannelClass(ioMode)) .option(ChannelOption.ALLOCATOR, allocator) {code} This fixed the issue. We just need 1 thread in the boss group because there is only a single server bootstrap. Please assign the issue to me so I can open up a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring
[ https://issues.apache.org/jira/browse/SPARK-30509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30509. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27200 [https://github.com/apache/spark/pull/27200] > Deprecation log warning is not printed in Avro schema inferring > --- > > Key: SPARK-30509 > URL: https://issues.apache.org/jira/browse/SPARK-30509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > The bug can be reproduced by the test: > {code} > test("log a warning of ignoreExtension deprecation") { > val logAppender = new LogAppender > withTempPath { dir => > Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1)) > .toDF("value", "p1", "p2") > .repartition(2) > .write > .format("avro") > .option("header", true) > .save(dir.getCanonicalPath) > withLogAppender(logAppender) { > spark > .read > .format("avro") > .option(AvroOptions.ignoreExtensionKey, false) > .option("header", true) > .load(dir.getCanonicalPath) > .count() > } > val deprecatedEvents = logAppender.loggingEvents > .filter(_.getRenderedMessage.contains( > s"Option ${AvroOptions.ignoreExtensionKey} is deprecated")) > assert(deprecatedEvents.size === 1) > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring
[ https://issues.apache.org/jira/browse/SPARK-30509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30509: - Assignee: Maxim Gekk > Deprecation log warning is not printed in Avro schema inferring > --- > > Key: SPARK-30509 > URL: https://issues.apache.org/jira/browse/SPARK-30509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > The bug can be reproduced by the test: > {code} > test("log a warning of ignoreExtension deprecation") { > val logAppender = new LogAppender > withTempPath { dir => > Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1)) > .toDF("value", "p1", "p2") > .repartition(2) > .write > .format("avro") > .option("header", true) > .save(dir.getCanonicalPath) > withLogAppender(logAppender) { > spark > .read > .format("avro") > .option(AvroOptions.ignoreExtensionKey, false) > .option("header", true) > .load(dir.getCanonicalPath) > .count() > } > val deprecatedEvents = logAppender.loggingEvents > .filter(_.getRenderedMessage.contains( > s"Option ${AvroOptions.ignoreExtensionKey} is deprecated")) > assert(deprecatedEvents.size === 1) > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- External issue ID: (was: SPARK-2840) > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} > pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 > {code} > while the job only had 1 speculative task running and 16 speculative tasks > intentionally killed because of corresponding original tasks had finished. > h3. The Bug > Upon examining the code of _pendingSpeculativeTasks_: > {code:java} > stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => > numTasks - > stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) > }.sum > {code} > where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on > _onSpeculativeTaskSubmitted_, but never decremented. > _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage > completion. *This means Spark is marking ended speculative tasks as pending, > which leads to Spark to hold more executors that it actually needs!* > I will have a PR ready to fix this issue, along with SPARK-2840 too > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} > pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 > {code} > while the job only had 1 speculative task running and 16 speculative tasks > intentionally killed because of corresponding original tasks had finished. > h3. The Bug > Upon examining the code of _pendingSpeculativeTasks_: > {code:java} > stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => > numTasks - > stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) > }.sum > {code} > where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on > _onSpeculativeTaskSubmitted_, but never decremented. > _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage > completion. *This means Spark is marking ended speculative tasks as pending, > which leads to Spark to hold more executors that it actually needs!* > I will have a PR ready to fix this issue, along with SPARK-2840 too > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- External issue ID: SPARK-2840 > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} > pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 > {code} > while the job only had 1 speculative task running and 16 speculative tasks > intentionally killed because of corresponding original tasks had finished. > h3. The Bug > Upon examining the code of _pendingSpeculativeTasks_: > {code:java} > stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => > numTasks - > stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) > }.sum > {code} > where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on > _onSpeculativeTaskSubmitted_, but never decremented. > _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage > completion. *This means Spark is marking ended speculative tasks as pending, > which leads to Spark to hold more executors that it actually needs!* > I will have a PR ready to fix this issue, along with SPARK-2840 too > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
Zebing Lin created SPARK-30511: -- Summary: Spark marks ended speculative tasks as pending leads to holding idle executors Key: SPARK-30511 URL: https://issues.apache.org/jira/browse/SPARK-30511 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.3.0 Reporter: Zebing Lin *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster
[ https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015335#comment-17015335 ] Jungtaek Lim commented on SPARK-30495: -- Adjusted priority as it's a regression. May need higher priority though, but given we have a PR closer to merge, major seems OK. > How to disable 'spark.security.credentials.${service}.enabled' in Structured > streaming while connecting to a kafka cluster > -- > > Key: SPARK-30495 > URL: https://issues.apache.org/jira/browse/SPARK-30495 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: act_coder >Priority: Major > > Trying to read data from a secured Kafka cluster using spark structured > streaming. Also, using the below library to read the data - > +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to > specify our custom group id (instead of spark setting its own custom group > id) > +*Dependency used in code:*+ > org.apache.spark > spark-sql-kafka-0-10_2.12 > 3.0.0-preview > > +*Logs:*+ > Getting the below error - even after specifying the required JAAS > configuration in spark options. > Caused by: java.lang.IllegalArgumentException: requirement failed: > *Delegation token must exist for this connector*. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275) > > +*Spark configuration used to read from Kafka:*+ > val kafkaDF = sparkSession.readStream > .format("kafka") > .option("kafka.bootstrap.servers", bootStrapServer) > .option("subscribe", kafkaTopic ) > > //Setting JAAS Configuration > .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL) > .option("kafka.sasl.mechanism", "PLAIN") > .option("kafka.security.protocol", "SASL_SSL") > // Setting custom consumer group id > .option("kafka.group.id", "test_cg") > .load() > > Following document specifies that we can disable the feature of obtaining > delegation token - > > [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html] > Tried setting this property *spark.security.credentials.kafka.enabled to* > *false in spark config,* but it is still failing with the same error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster
[ https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-30495: - Priority: Major (was: Minor) > How to disable 'spark.security.credentials.${service}.enabled' in Structured > streaming while connecting to a kafka cluster > -- > > Key: SPARK-30495 > URL: https://issues.apache.org/jira/browse/SPARK-30495 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: act_coder >Priority: Major > > Trying to read data from a secured Kafka cluster using spark structured > streaming. Also, using the below library to read the data - > +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to > specify our custom group id (instead of spark setting its own custom group > id) > +*Dependency used in code:*+ > org.apache.spark > spark-sql-kafka-0-10_2.12 > 3.0.0-preview > > +*Logs:*+ > Getting the below error - even after specifying the required JAAS > configuration in spark options. > Caused by: java.lang.IllegalArgumentException: requirement failed: > *Delegation token must exist for this connector*. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275) > > +*Spark configuration used to read from Kafka:*+ > val kafkaDF = sparkSession.readStream > .format("kafka") > .option("kafka.bootstrap.servers", bootStrapServer) > .option("subscribe", kafkaTopic ) > > //Setting JAAS Configuration > .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL) > .option("kafka.sasl.mechanism", "PLAIN") > .option("kafka.security.protocol", "SASL_SSL") > // Setting custom consumer group id > .option("kafka.group.id", "test_cg") > .load() > > Following document specifies that we can disable the feature of obtaining > delegation token - > > [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html] > Tried setting this property *spark.security.credentials.kafka.enabled to* > *false in spark config,* but it is still failing with the same error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30488) Deadlock between block-manager-slave-async-thread-pool and spark context cleaner
[ https://issues.apache.org/jira/browse/SPARK-30488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015308#comment-17015308 ] Rohit Agrawal commented on SPARK-30488: --- [~ajithshetty] We use the following to create spark context: SparkSession.builder().config(finalSparkConf).getOrCreate() > Deadlock between block-manager-slave-async-thread-pool and spark context > cleaner > > > Key: SPARK-30488 > URL: https://issues.apache.org/jira/browse/SPARK-30488 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Rohit Agrawal >Priority: Major > > Deadlock happens while cleaning up the spark context. Here is the full thread > dump: > > > 2020-01-10T20:13:16.2884057Z Full thread dump Java HotSpot(TM) 64-Bit > Server VM (25.221-b11 mixed mode): > 2020-01-10T20:13:16.2884392Z > 2020-01-10T20:13:16.2884660Z "SIGINT handler" #488 daemon prio=9 os_prio=2 > tid=0x111fa000 nid=0x4794 waiting for monitor entry > [0x1c86e000] > 2020-01-10T20:13:16.2884807Z java.lang.Thread.State: BLOCKED (on object > monitor) > 2020-01-10T20:13:16.2884879Z at java.lang.Shutdown.exit(Shutdown.java:212) > 2020-01-10T20:13:16.2885693Z - waiting to lock <0xc0155de0> (a > java.lang.Class for java.lang.Shutdown) > 2020-01-10T20:13:16.2885840Z at > java.lang.Terminator$1.handle(Terminator.java:52) > 2020-01-10T20:13:16.2885965Z at sun.misc.Signal$1.run(Signal.java:212) > 2020-01-10T20:13:16.2886329Z at java.lang.Thread.run(Thread.java:748) > 2020-01-10T20:13:16.2886430Z > 2020-01-10T20:13:16.2886752Z "Thread-3" #108 prio=5 os_prio=0 > tid=0x111f7800 nid=0x48cc waiting for monitor entry > [0x2c33f000] > 2020-01-10T20:13:16.2886881Z java.lang.Thread.State: BLOCKED (on object > monitor) > 2020-01-10T20:13:16.2886999Z at > org.apache.hadoop.util.ShutdownHookManager.getShutdownHooksInOrder(ShutdownHookManager.java:273) > 2020-01-10T20:13:16.2887107Z at > org.apache.hadoop.util.ShutdownHookManager.executeShutdown(ShutdownHookManager.java:121) > 2020-01-10T20:13:16.2887212Z at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:95) > 2020-01-10T20:13:16.2887421Z > 2020-01-10T20:13:16.2887798Z "block-manager-slave-async-thread-pool-81" #486 > daemon prio=5 os_prio=0 tid=0x111fe800 nid=0x2e34 waiting for monitor > entry [0x2bf3d000] > 2020-01-10T20:13:16.2889192Z java.lang.Thread.State: BLOCKED (on object > monitor) > 2020-01-10T20:13:16.2889305Z at > java.lang.ClassLoader.loadClass(ClassLoader.java:404) > 2020-01-10T20:13:16.2889405Z - waiting to lock <0xc1f359f0> (a > sbt.internal.LayeredClassLoader) > 2020-01-10T20:13:16.2889482Z at > java.lang.ClassLoader.loadClass(ClassLoader.java:411) > 2020-01-10T20:13:16.2889582Z - locked <0xca33e4c8> (a > sbt.internal.ManagedClassLoader$ZombieClassLoader) > 2020-01-10T20:13:16.2889659Z at > java.lang.ClassLoader.loadClass(ClassLoader.java:357) > 2020-01-10T20:13:16.2890881Z at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply$mcZ$sp(BlockManagerSlaveEndpoint.scala:58) > 2020-01-10T20:13:16.2891006Z at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57) > 2020-01-10T20:13:16.2891142Z at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$3.apply(BlockManagerSlaveEndpoint.scala:57) > 2020-01-10T20:13:16.2891260Z at > org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:86) > 2020-01-10T20:13:16.2891375Z at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > 2020-01-10T20:13:16.2891624Z at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > 2020-01-10T20:13:16.2891737Z at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2020-01-10T20:13:16.2891833Z at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2020-01-10T20:13:16.2891925Z at java.lang.Thread.run(Thread.java:748) > 2020-01-10T20:13:16.2891967Z > 2020-01-10T20:13:16.2892066Z "pool-31-thread-16" #335 prio=5 os_prio=0 > tid=0x153b2000 nid=0x1aac waiting on condition [0x4b2ff000] > 2020-01-10T20:13:16.2892147Z java.lang.Thread.State: WAITING (parking) > 2020-01-10T20:13:16.2892241Z at sun.misc.Unsafe.park(Native Method) > 2020-01-10T20:13:16.2892328Z - parking to wait for <0xc9cad078> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > 2020-01-10T20:13:16.2892437Z at >
[jira] [Created] (SPARK-30510) Document spark.sql.sources.partitionOverwriteMode
Nicholas Chammas created SPARK-30510: Summary: Document spark.sql.sources.partitionOverwriteMode Key: SPARK-30510 URL: https://issues.apache.org/jira/browse/SPARK-30510 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.4.4 Reporter: Nicholas Chammas SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, but it doesn't appear to be documented in [the expected place|http://spark.apache.org/docs/2.4.4/configuration.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-27142: -- Assignee: Ajith S > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Attachments: image-2019-03-13-19-29-26-896.png > > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found > > Details: > https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27142) Provide REST API for SQL level information
[ https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-27142. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 24076 [https://github.com/apache/spark/pull/24076] > Provide REST API for SQL level information > -- > > Key: SPARK-27142 > URL: https://issues.apache.org/jira/browse/SPARK-27142 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Assignee: Ajith S >Priority: Minor > Fix For: 3.0.0 > > Attachments: image-2019-03-13-19-29-26-896.png > > > Currently for Monitoring Spark application SQL information is not available > from REST but only via UI. REST provides only > applications,jobs,stages,environment. This Jira is targeted to provide a REST > API so that SQL level information can be found > > Details: > https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30424) Change ExpressionEncoder toRow method to return UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-30424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015183#comment-17015183 ] Erik Erlandson commented on SPARK-30424: The main place this change causes a compile fail on is in SparkSession: {code:java} def createDataFrame(rowRDD: RDD[Row], schema: StructType): DataFrame{code} And the key RDD impacted is LogicalRDD. What I'm wondering is whether it is appropriate to change the signature of the RDD in LogicalRDD from InternalRow to the more specific UnsafeRow. My intuition is no, however it's also true that this is what's actually occurring under the hood currently, so I'm curious what the catalyst maintainers think about it. > Change ExpressionEncoder toRow method to return UnsafeRow > - > > Key: SPARK-30424 > URL: https://issues.apache.org/jira/browse/SPARK-30424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Minor > > [~wenchen] observed that the toRow() method on ExpressionEncoder can have its > return type specified as UnsafeRow. See discussion on > [https://github.com/apache/spark/pull/25024] > > Not a high priority but could be done for 3.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-9478. - Fix Version/s: 3.0.0 Resolution: Fixed > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw >Assignee: zhengruifeng >Priority: Major > Fix For: 3.0.0 > > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reopened SPARK-9478: - > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw >Assignee: zhengruifeng >Priority: Major > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9478) Add sample weights to Random Forest
[ https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-9478: --- Assignee: zhengruifeng > Add sample weights to Random Forest > --- > > Key: SPARK-9478 > URL: https://issues.apache.org/jira/browse/SPARK-9478 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.4.1 >Reporter: Patrick Crenshaw >Assignee: zhengruifeng >Priority: Major > > Currently, this implementation of random forest does not support sample > (instance) weights. Weights are important when there is imbalanced training > data or the evaluation metric of a classifier is imbalanced (e.g. true > positive rate at some false positive threshold). Sample weights generalize > class weights, so this could be used to add class weights later on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30423) Deprecate UserDefinedAggregateFunction
[ https://issues.apache.org/jira/browse/SPARK-30423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30423. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27193 [https://github.com/apache/spark/pull/27193] > Deprecate UserDefinedAggregateFunction > -- > > Key: SPARK-30423 > URL: https://issues.apache.org/jira/browse/SPARK-30423 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Assignee: Erik Erlandson >Priority: Major > Fix For: 3.0.0 > > > Anticipating the merging of SPARK-27296, the legacy methodology for > implementing custom user defined aggregators over untyped DataFrame based on > UserDefinedAggregateFunction will be made obsolete. This class should be > annotated as deprecated once the new capability is officially merged. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29544. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26434 [https://github.com/apache/spark/pull/26434] > Optimize skewed join at runtime with new Adaptive Execution > --- > > Key: SPARK-29544 > URL: https://issues.apache.org/jira/browse/SPARK-29544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > Attachments: Skewed Join Optimization Design Doc.docx > > > Implement a rule in the new adaptive execution framework introduced in > [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is > used to handle the skew join optimization based on the runtime statistics > (data size and row count). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29544: --- Assignee: Ke Jia > Optimize skewed join at runtime with new Adaptive Execution > --- > > Key: SPARK-29544 > URL: https://issues.apache.org/jira/browse/SPARK-29544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Attachments: Skewed Join Optimization Design Doc.docx > > > Implement a rule in the new adaptive execution framework introduced in > [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is > used to handle the skew join optimization based on the runtime statistics > (data size and row count). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30325) markPartitionCompleted cause task status inconsistent
[ https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30325: --- Assignee: haiyangyu > markPartitionCompleted cause task status inconsistent > - > > Key: SPARK-30325 > URL: https://issues.apache.org/jira/browse/SPARK-30325 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: haiyangyu >Assignee: haiyangyu >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-12-21-17-11-38-565.png, > image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, > image-2019-12-21-17-17-42-244.png > > > h3. Corner case > The bugs occurs in the coren case as follows: > # The stage occurs for fetchFailed and some task hasn't finished, scheduler > will resubmit a new stage as retry with those unfinished tasks. > # The unfinished task in origin stage finished and the same task on the new > retry stage hasn't finished, it will mark the task partition on the new retry > stage as succesuful. !image-2019-12-21-17-11-38-565.png|width=427,height=154! > # The executor running those 'successful task' crashed, it cause > taskSetManager run executorLost to rescheduler the task on the executor, here > will cause copiesRunning decreate 1 twice, beause those 'successful task' are > not finished, the variable copiesRunning will decreate to -1 as result. > !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139! > # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis > when rescheduler tasks, and now it is -1, can't to reschedule, and the app > will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30325) markPartitionCompleted cause task status inconsistent
[ https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30325. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26975 [https://github.com/apache/spark/pull/26975] > markPartitionCompleted cause task status inconsistent > - > > Key: SPARK-30325 > URL: https://issues.apache.org/jira/browse/SPARK-30325 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: haiyangyu >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-12-21-17-11-38-565.png, > image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, > image-2019-12-21-17-17-42-244.png > > > h3. Corner case > The bugs occurs in the coren case as follows: > # The stage occurs for fetchFailed and some task hasn't finished, scheduler > will resubmit a new stage as retry with those unfinished tasks. > # The unfinished task in origin stage finished and the same task on the new > retry stage hasn't finished, it will mark the task partition on the new retry > stage as succesuful. !image-2019-12-21-17-11-38-565.png|width=427,height=154! > # The executor running those 'successful task' crashed, it cause > taskSetManager run executorLost to rescheduler the task on the executor, here > will cause copiesRunning decreate 1 twice, beause those 'successful task' are > not finished, the variable copiesRunning will decreate to -1 as result. > !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139! > # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis > when rescheduler tasks, and now it is -1, can't to reschedule, and the app > will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30295) Remove Hive dependencies from SparkSQLCLI
[ https://issues.apache.org/jira/browse/SPARK-30295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015028#comment-17015028 ] Javier Fuentes commented on SPARK-30295: Yes, this purely to try to remove more hive dependencies and reduce the footprint replacing exisisting java code with scala implementations. > Remove Hive dependencies from SparkSQLCLI > - > > Key: SPARK-30295 > URL: https://issues.apache.org/jira/browse/SPARK-30295 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Javier Fuentes >Priority: Major > > Removal of unnecessary hive dependencies for the Spark SQL Client. Replacing > that with a native Scala implementation. For the client driver, argument > parser and SparkSqlCliDriver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30509) Deprecation log warning is not printed in Avro schema inferring
Maxim Gekk created SPARK-30509: -- Summary: Deprecation log warning is not printed in Avro schema inferring Key: SPARK-30509 URL: https://issues.apache.org/jira/browse/SPARK-30509 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The bug can be reproduced by the test: {code} test("log a warning of ignoreExtension deprecation") { val logAppender = new LogAppender withTempPath { dir => Seq(("a", 1, 2), ("b", 1, 2), ("c", 2, 1), ("d", 2, 1)) .toDF("value", "p1", "p2") .repartition(2) .write .format("avro") .option("header", true) .save(dir.getCanonicalPath) withLogAppender(logAppender) { spark .read .format("avro") .option(AvroOptions.ignoreExtensionKey, false) .option("header", true) .load(dir.getCanonicalPath) .count() } val deprecatedEvents = logAppender.loggingEvents .filter(_.getRenderedMessage.contains( s"Option ${AvroOptions.ignoreExtensionKey} is deprecated")) assert(deprecatedEvents.size === 1) } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30498) Fix some ml parity issues between python and scala
[ https://issues.apache.org/jira/browse/SPARK-30498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30498: Assignee: Huaxin Gao > Fix some ml parity issues between python and scala > -- > > Key: SPARK-30498 > URL: https://issues.apache.org/jira/browse/SPARK-30498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > remove setters in CrossValidatorModel and TrainValidationSplitModel in > tuning.py. use _set to set the params. > add setInputCol/setOutputCol in Python ImputerModel. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30498) Fix some ml parity issues between python and scala
[ https://issues.apache.org/jira/browse/SPARK-30498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30498. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27196 [https://github.com/apache/spark/pull/27196] > Fix some ml parity issues between python and scala > -- > > Key: SPARK-30498 > URL: https://issues.apache.org/jira/browse/SPARK-30498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > remove setters in CrossValidatorModel and TrainValidationSplitModel in > tuning.py. use _set to set the params. > add setInputCol/setOutputCol in Python ImputerModel. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-30292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30292. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26933 [https://github.com/apache/spark/pull/26933] > Throw Exception when invalid string is cast to decimal in ANSI mode > --- > > Key: SPARK-30292 > URL: https://issues.apache.org/jira/browse/SPARK-30292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.0.0 > > > When spark.sql.ansi.enabled is set, > If we run select cast('str' as decimal), spark-sql outputs NULL. > The ANSI SQL standard requires to throw exception when invalid strings are > cast to numbers. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30292) Throw Exception when invalid string is cast to decimal in ANSI mode
[ https://issues.apache.org/jira/browse/SPARK-30292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30292: --- Assignee: Rakesh Raushan > Throw Exception when invalid string is cast to decimal in ANSI mode > --- > > Key: SPARK-30292 > URL: https://issues.apache.org/jira/browse/SPARK-30292 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Rakesh Raushan >Assignee: Rakesh Raushan >Priority: Minor > > When spark.sql.ansi.enabled is set, > If we run select cast('str' as decimal), spark-sql outputs NULL. > The ANSI SQL standard requires to throw exception when invalid strings are > cast to numbers. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30508) Add DataFrameReader.executeCommand API for external datasource
wuyi created SPARK-30508: Summary: Add DataFrameReader.executeCommand API for external datasource Key: SPARK-30508 URL: https://issues.apache.org/jira/browse/SPARK-30508 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Add DataFrameReader.executeCommand API for external datasource in order to make external datasources be able to execute some custom DDL/DML commands. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28242) DataStreamer keeps logging errors even after fixing writeStream output sink
[ https://issues.apache.org/jira/browse/SPARK-28242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014919#comment-17014919 ] Hyokun Park commented on SPARK-28242: - Hi [~mcanes] In my case, I resolved the problem by adding a configuration. Please add this configuration "--conf spark.hadoop.dfs.client.block.write.replace-datanode-on-failure.enable=false" in your spark-submit command. > DataStreamer keeps logging errors even after fixing writeStream output sink > --- > > Key: SPARK-28242 > URL: https://issues.apache.org/jira/browse/SPARK-28242 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 > Environment: Hadoop 2.8.4 > >Reporter: Miquel Canes >Priority: Minor > > I have been testing what happens to a running structured streaming that is > writing to HDFS when all datanodes are down/stopped or all cluster is down > (including namenode) > So I created a structured stream from kafka to a File output sink to HDFS and > tested some scenarios. > We used a very simple streamings: > {code:java} > spark.readStream() > .format("kafka") > .option("kafka.bootstrap.servers", "kafka.server:9092...") > .option("subscribe", "test_topic") > .load() > .select(col("value").cast(DataTypes.StringType)) > .writeStream() > .format("text") > .option("path", "HDFS/PATH") > .option("checkpointLocation", "checkpointPath") > .start() > .awaitTermination();{code} > > After stopping all the datanodes the process starts logging the error that > datanodes are bad. > That's correct... > {code:java} > 2019-07-03 15:55:00 [spark-listener-group-eventLog] ERROR > org.apache.spark.scheduler.AsyncEventQueue:91 - Listener EventLoggingListener > threw an exception java.io.IOException: All datanodes > [DatanodeInfoWithStorage[10.2.12.202:50010,DS-d2fba01b-28eb-4fe4-baaa-4072102a2172,DISK]] > are bad. Aborting... at > org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1530) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657) > {code} > The problem is that even after starting again the datanodes the process keeps > logging the same error all the time. > We checked and the WriteStream to HDFS recovered successfully after starting > the datanodes and the output sink worked again without problems. > I have been trying some different HDFS configurations to be sure it's not a > client config related problem but with no clue about how to fix it. > It seams that something is stuck indefinitely in an error loop. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org