[jira] [Updated] (SPARK-29450) [SS] In streaming aggregation, metric for output rows is not measured in append mode
[ https://issues.apache.org/jira/browse/SPARK-29450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29450: Fix Version/s: 2.4.5 > [SS] In streaming aggregation, metric for output rows is not measured in > append mode > > > Key: SPARK-29450 > URL: https://issues.apache.org/jira/browse/SPARK-29450 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > The bug is reported in dev. mailing list. Credit to [~jlaskowski]. Quoting > here: > > {quote} > I've just noticed that the number of output rows metric of StateStoreSaveExec > physical operator does not seem to be measured for Append output mode. In > other words, whatever happens before or after StateStoreSaveExec operator the > metric is always 0. > > It is measured for the other modes - Complete and Update. > {quote} > https://lists.apache.org/thread.html/eec318f7ff84c700bffc8bdb7c95c0dcaffb2494ac7ccd7a7c7c6588@%3Cdev.spark.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service
[ https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249 ] t oo edited comment on SPARK-27750 at 1/16/20 6:33 AM: --- WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] [~zsxwing] [~jlaskowski] [~cloud_fan] [~srowen] [~dongjoon] [~hyukjin.kwon] was (Author: toopt4): WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] [~zsxwing] [~jlaskowski] [~cloud_fan] > Standalone scheduler - ability to prioritize applications over drivers, many > drivers act like Denial of Service > --- > > Key: SPARK-27750 > URL: https://issues.apache.org/jira/browse/SPARK-27750 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: t oo >Priority: Minor > > If I submit 1000 spark submit drivers then they consume all the cores on my > cluster (essentially it acts like a Denial of Service) and no spark > 'application' gets to run since the cores are all consumed by the 'drivers'. > This feature is about having the ability to prioritize applications over > drivers so that at least some 'applications' can start running. I guess it > would be like: If (driver.state = 'submitted' and (exists some app.state = > 'submitted')) then set app.state = 'running' > if all apps have app.state = 'running' then set driver.state = 'submitted' > > Secondary to this, why must a driver consume a minimum of 1 entire core? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo
[ https://issues.apache.org/jira/browse/SPARK-30521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30521. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27221 [https://github.com/apache/spark/pull/27221] > Eliminate deprecation warnings for ExpressionInfo > - > > Key: SPARK-30521 > URL: https://issues.apache.org/jira/browse/SPARK-30521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > {code} > Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended > usage"), > Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended > usage"), > Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended > usage"), > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo
[ https://issues.apache.org/jira/browse/SPARK-30521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30521: Assignee: Maxim Gekk > Eliminate deprecation warnings for ExpressionInfo > - > > Key: SPARK-30521 > URL: https://issues.apache.org/jira/browse/SPARK-30521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > {code} > Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended > usage"), > Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended > usage"), > Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is > deprecated: see corresponding Javadoc for more information. > new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended > usage"), > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016541#comment-17016541 ] Dongjoon Hyun commented on SPARK-27868: --- Hi, [~vanzin]. Since there is no proper way to set `Fixed Version` for reverting, I removed `2.4.4` tag and add `release-note` for 2.4.5. Please advise me if there is a better way~ Thanks. > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27868: -- Labels: release-notes (was: ) > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27868: -- Fix Version/s: (was: 2.4.4) > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Fix For: 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog
[ https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016540#comment-17016540 ] Dongjoon Hyun commented on SPARK-27868: --- This is reverted from `branch-2.4` via https://github.com/apache/spark/commit/94b5d3fd64ea5498b19f9ea7aacb484a18496018 > Better document shuffle / RPC listen backlog > > > Key: SPARK-27868 > URL: https://issues.apache.org/jira/browse/SPARK-27868 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 2.4.3 >Reporter: Marcelo Masiero Vanzin >Assignee: Marcelo Masiero Vanzin >Priority: Minor > Fix For: 2.4.4, 3.0.0 > > > The option to control the listen socket backlog for RPC and shuffle servers > is not documented in our public docs. > The only piece of documentation is in a Java class, and even that > documentation is incorrect: > {code} > /** Requested maximum length of the queue of incoming connections. Default > -1 for no backlog. */ > public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, > -1); } > {code} > The default value actual causes the default value from the JRE to be used, > which is 50 according to the docs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30491) Enable dependency audit files to tell dependency classifier
[ https://issues.apache.org/jira/browse/SPARK-30491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30491: - Assignee: Xinrong Meng > Enable dependency audit files to tell dependency classifier > --- > > Key: SPARK-30491 > URL: https://issues.apache.org/jira/browse/SPARK-30491 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 3.0.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > Dependency audit files under `dev/deps` only show jar names. Given that, it > is not trivial to figure out the dependency classifiers. > For example, `avro-mapred-1.8.2-hadoop2.jar` is made up of artifact id > `avro-mapred`, version `1.8.2`, and classifier `hadoop2`. In contrast, > `htrace-core-3.1.0-incubating.jar` is made up of artifact id `htrace-core`, > and version `3.1.0-incubating.jar`. > All in all, the classifier can't be told from its position in jar name, > however, as part of the identifier of dependency, it should be clearly > figured out. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30491) Enable dependency audit files to tell dependency classifier
[ https://issues.apache.org/jira/browse/SPARK-30491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30491. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27177 [https://github.com/apache/spark/pull/27177] > Enable dependency audit files to tell dependency classifier > --- > > Key: SPARK-30491 > URL: https://issues.apache.org/jira/browse/SPARK-30491 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4, 3.0.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.0.0 > > > Dependency audit files under `dev/deps` only show jar names. Given that, it > is not trivial to figure out the dependency classifiers. > For example, `avro-mapred-1.8.2-hadoop2.jar` is made up of artifact id > `avro-mapred`, version `1.8.2`, and classifier `hadoop2`. In contrast, > `htrace-core-3.1.0-incubating.jar` is made up of artifact id `htrace-core`, > and version `3.1.0-incubating.jar`. > All in all, the classifier can't be told from its position in jar name, > however, as part of the identifier of dependency, it should be clearly > figured out. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30323) Support filters pushdown in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-30323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30323: Assignee: Maxim Gekk > Support filters pushdown in CSV datasource > -- > > Key: SPARK-30323 > URL: https://issues.apache.org/jira/browse/SPARK-30323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > - Implement the `SupportsPushDownFilters` interface in `CSVScanBuilder` > - Apply filters in UnivocityParser > - Change API UnivocityParser - return Seq[InternalRow] from `convert()` > - Update CSVBenchmark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30323) Support filters pushdown in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-30323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30323. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26973 [https://github.com/apache/spark/pull/26973] > Support filters pushdown in CSV datasource > -- > > Key: SPARK-30323 > URL: https://issues.apache.org/jira/browse/SPARK-30323 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > - Implement the `SupportsPushDownFilters` interface in `CSVScanBuilder` > - Apply filters in UnivocityParser > - Change API UnivocityParser - return Seq[InternalRow] from `convert()` > - Update CSVBenchmark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] phanikumar updated SPARK-30522: --- Description: I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters. {code:java} 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour ces like Flume. See the programming guide for details on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [1,2,3] check.crcs = true client.id = client-0 connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = telemetry-streaming-service heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer {code} Here is the log for other executors. {code:java} 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service. 20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1 20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0 20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB) 20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free
[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] phanikumar updated SPARK-30522: --- Description: I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters. {code:java} 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour ces like Flume. See the programming guide for details on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [1,2,3] check.crcs = true client.id = client-0 connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = telemetry-streaming-service heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer {code} Here is the log for other executors. {code:java} 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service. 20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1 20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0 20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB) 20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free
[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] phanikumar updated SPARK-30522: --- Description: I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters. {code:java} 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour ces like Flume. See the programming guide for details on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [1,2,3] check.crcs = true client.id = client-0 connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = telemetry-streaming-service heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer {code} Here is the log for other executors. {code:java} 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service. 20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1 20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0 20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB) 20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free
[jira] [Updated] (SPARK-30325) markPartitionCompleted cause task status inconsistent
[ https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30325: Fix Version/s: 2.4.5 > markPartitionCompleted cause task status inconsistent > - > > Key: SPARK-30325 > URL: https://issues.apache.org/jira/browse/SPARK-30325 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.4 >Reporter: haiyangyu >Assignee: haiyangyu >Priority: Major > Fix For: 2.4.5, 3.0.0 > > Attachments: image-2019-12-21-17-11-38-565.png, > image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, > image-2019-12-21-17-17-42-244.png > > > h3. Corner case > The bugs occurs in the coren case as follows: > # The stage occurs for fetchFailed and some task hasn't finished, scheduler > will resubmit a new stage as retry with those unfinished tasks. > # The unfinished task in origin stage finished and the same task on the new > retry stage hasn't finished, it will mark the task partition on the new retry > stage as succesuful. !image-2019-12-21-17-11-38-565.png|width=427,height=154! > # The executor running those 'successful task' crashed, it cause > taskSetManager run executorLost to rescheduler the task on the executor, here > will cause copiesRunning decreate 1 twice, beause those 'successful task' are > not finished, the variable copiesRunning will decreate to -1 as result. > !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139! > # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis > when rescheduler tasks, and now it is -1, can't to reschedule, and the app > will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] phanikumar updated SPARK-30522: --- Description: I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters. {code:java} 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour ces like Flume. See the programming guide for details on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [1,2,3] check.crcs = true client.id = client-0 connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = telemetry-streaming-service heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer {code} Here is the log for other executors. {code:java} 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service. 20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1 20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0 20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB) 20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 ms 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 17.9 KB, free
[jira] [Assigned] (SPARK-30502) PeriodicRDDCheckpointer supports storageLevel
[ https://issues.apache.org/jira/browse/SPARK-30502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-30502: Assignee: zhengruifeng > PeriodicRDDCheckpointer supports storageLevel > - > > Key: SPARK-30502 > URL: https://issues.apache.org/jira/browse/SPARK-30502 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > Intermediate RDDs in ML are cached with > storageLevel=StorageLevel.MEMORY_AND_DISK. > PeriodicRDDCheckpointer will store RDD with > storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel > of checkpointer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30502) PeriodicRDDCheckpointer supports storageLevel
[ https://issues.apache.org/jira/browse/SPARK-30502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30502. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27189 [https://github.com/apache/spark/pull/27189] > PeriodicRDDCheckpointer supports storageLevel > - > > Key: SPARK-30502 > URL: https://issues.apache.org/jira/browse/SPARK-30502 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.0.0 > > > Intermediate RDDs in ML are cached with > storageLevel=StorageLevel.MEMORY_AND_DISK. > PeriodicRDDCheckpointer will store RDD with > storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel > of checkpointer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30518) Precision and scale should be same for values between -1.0 and 1.0 in Decimal
[ https://issues.apache.org/jira/browse/SPARK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30518. -- Fix Version/s: 3.0.0 Assignee: wuyi Resolution: Fixed Resolved by https://github.com/apache/spark/pull/27217 > Precision and scale should be same for values between -1.0 and 1.0 in Decimal > - > > Key: SPARK-30518 > URL: https://issues.apache.org/jira/browse/SPARK-30518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Currently, for values between -1.0 and 1.0, precision and scale is > inconsistent between Decimal and DecimalType. For example, for numbers like > 0.3, it has (precision, scale) as (2, 1) in Decimal, but (1, 1) in > DecimalType. We should make Decimal be consistent with DecimalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.
[ https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ke Jia updated SPARK-30524: --- Description: The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And it may introduce additional shuffle after apply the OptimizeSkewedJoin. This PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle. > Disable OptimizeSkewJoin rule if introducing additional shuffle. > > > Key: SPARK-30524 > URL: https://issues.apache.org/jira/browse/SPARK-30524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And > it may introduce additional shuffle after apply the OptimizeSkewedJoin. This > PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.
Ke Jia created SPARK-30524: -- Summary: Disable OptimizeSkewJoin rule if introducing additional shuffle. Key: SPARK-30524 URL: https://issues.apache.org/jira/browse/SPARK-30524 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Ke Jia -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26736) if filter condition `And` has non-determined sub function it does not do partition prunning
[ https://issues.apache.org/jira/browse/SPARK-26736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-26736. -- Fix Version/s: 3.0.0 Assignee: Takeshi Yamamuro Resolution: Fixed Resolved by https://github.com/apache/spark/pull/27219 > if filter condition `And` has non-determined sub function it does not do > partition prunning > --- > > Key: SPARK-26736 > URL: https://issues.apache.org/jira/browse/SPARK-26736 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: roncenzhao >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > > Example: > A partitioned table definition: > _create table test(id int) partitioned by (dt string);_ > The following sql does not do partition prunning: > _select * from test where dt='20190101' and rand() < 0.5;_ > > I think it should do partition prunning in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service
[ https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249 ] t oo edited comment on SPARK-27750 at 1/15/20 11:36 PM: WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] [~zsxwing] [~jlaskowski] [~cloud_fan] was (Author: toopt4): WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] [~zsxwing] [~jlaskowski] > Standalone scheduler - ability to prioritize applications over drivers, many > drivers act like Denial of Service > --- > > Key: SPARK-27750 > URL: https://issues.apache.org/jira/browse/SPARK-27750 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: t oo >Priority: Minor > > If I submit 1000 spark submit drivers then they consume all the cores on my > cluster (essentially it acts like a Denial of Service) and no spark > 'application' gets to run since the cores are all consumed by the 'drivers'. > This feature is about having the ability to prioritize applications over > drivers so that at least some 'applications' can start running. I guess it > would be like: If (driver.state = 'submitted' and (exists some app.state = > 'submitted')) then set app.state = 'running' > if all apps have app.state = 'running' then set driver.state = 'submitted' > > Secondary to this, why must a driver consume a minimum of 1 entire core? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30523) Collapse back to back aggregations into a single aggregate to reduce the number of shuffles
Jason Altekruse created SPARK-30523: --- Summary: Collapse back to back aggregations into a single aggregate to reduce the number of shuffles Key: SPARK-30523 URL: https://issues.apache.org/jira/browse/SPARK-30523 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.0.0 Reporter: Jason Altekruse Queries containing nested aggregate operations can in some cases be computable with a single phase of aggregation. This Jira seeks to introduce a new optimizer rule to identify some of those cases and rewrite plans to avoid needlessly re-shuffling and generating the aggregation hash table data twice. Some examples of collapsible aggregates: {code:java} SELECT sum(sumAgg) as a, year from ( select sum(1) as sumAgg, course, year FROM courseSales GROUP BY course, year ) group by year // can be collapsed to SELECT sum(1) as `a`, year from courseSales group by year {code} {code} SELECT sum(agg), min(a), b from ( select sum(1) as agg, a, b FROM testData2 GROUP BY a, b ) group by b // can be collapsed to SELECT sum(1) as `sum(agg)`, min(a) as `min(a)`, b from testData2 group by b {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
[ https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrique dos Santos Goulart closed SPARK-23273. --- > Spark Dataset withColumn - schema column order isn't the same as case class > paramether order > > > Key: SPARK-23273 > URL: https://issues.apache.org/jira/browse/SPARK-23273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Henrique dos Santos Goulart >Priority: Major > > {code:java} > case class OnlyAge(age: Int) > case class NameAge(name: String, age: Int) > val ds1 = spark.emptyDataset[NameAge] > val ds2 = spark > .createDataset(Seq(OnlyAge(1))) > .withColumn("name", lit("henriquedsg89")) > .as[NameAge] > ds1.show() > ds2.show() > ds1.union(ds2) > {code} > > It's going to raise this error: > {noformat} > Cannot up cast `age` from string to int as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "age") > - root class: "dw.NameAge"{noformat} > It seems that .as[CaseClass] doesn't keep the order of paramethers that is > typed on case class. > If I change the case class paramether order, it's going to work... like: > {code:java} > case class NameAge(age: Int, name: String){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin updated SPARK-30246: --- Fix Version/s: (was: 2.4.5) 2.4.6 > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Assignee: Henrique dos Santos Goulart >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-30246: -- Assignee: Henrique dos Santos Goulart > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Assignee: Henrique dos Santos Goulart >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak
[ https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-30246. Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 27064 [https://github.com/apache/spark/pull/27064] > Spark on Yarn External Shuffle Service Memory Leak > -- > > Key: SPARK-30246 > URL: https://issues.apache.org/jira/browse/SPARK-30246 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.4.3 > Environment: hadoop 2.7.3 > spark 2.4.3 > jdk 1.8.0_60 >Reporter: huangweiyi >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > In our large busy yarn cluster which deploy Spark external shuffle service as > part of YARN NM aux service, we encountered OOM in some NMs. > after i dump the heap memory and found there are some StremState objects > still in heap, but the app which the StreamState belongs to is already > finished. > Here is some relate Figures: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%! > The heap dump below shows that the memory consumption mainly consists of two > parts: > *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)* > *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )* > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%! > dig into the OneForOneStreamManager, there are some StreaStates still > remained : > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%! > incomming references to StreamState::associatedChannel: > !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016316#comment-17016316 ] Zebing Lin edited comment on SPARK-28403 at 1/15/20 8:53 PM: - In our production, this just caused a fluctuation of requested executors: {code:java} Total executors: Running = 6, Needed = 6, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 {code} I think this logic can be deleted. was (Author: zebingl): In our production, this just caused a fluctuation of requested executors: {code:java} Total executors: Running = 6, Needed = 6, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 Total executors: Running = 6, Needed = 6, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 {code} I think this logic can be deleted. > Executor Allocation Manager can add an extra executor when speculative tasks > > > Key: SPARK-28403 > URL: https://issues.apache.org/jira/browse/SPARK-28403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > It looks like SPARK-19326 added a bug in the execuctor allocation maanger > where it adds an extra executor when it shouldn't when we have pending > speculative tasks but the target number didn't change. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377] > It doesn't look like this is necessary since it already added in the > pendingSpeculative tasks. > See the questioning of this on the PR at: > https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016316#comment-17016316 ] Zebing Lin commented on SPARK-28403: In our production, this just caused a fluctuation of requested executors: {code:java} Total executors: Running = 6, Needed = 6, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 Total executors: Running = 6, Needed = 6, Requested = 6 Lowering target number of executors to 5 (previously 6) because not all requested executors are actually needed Total executors: Running = 6, Needed = 5, Requested = 6 {code} I think this logic can be deleted. > Executor Allocation Manager can add an extra executor when speculative tasks > > > Key: SPARK-28403 > URL: https://issues.apache.org/jira/browse/SPARK-28403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Major > > It looks like SPARK-19326 added a bug in the execuctor allocation maanger > where it adds an extra executor when it shouldn't when we have pending > speculative tasks but the target number didn't change. > [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377] > It doesn't look like this is necessary since it already added in the > pendingSpeculative tasks. > See the questioning of this on the PR at: > https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29448) Support the `INTERVAL` type by Parquet datasource
[ https://issues.apache.org/jira/browse/SPARK-29448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-29448. Resolution: Won't Fix > Support the `INTERVAL` type by Parquet datasource > - > > Key: SPARK-29448 > URL: https://issues.apache.org/jira/browse/SPARK-29448 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Major > > Parquet format allows to store intervals as triple of (milliseconds, days, > months) see > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval > . The `INTERVAL` logical type is used for an interval of time. _It must > annotate a fixed_len_byte_array of length 12. This array stores three > little-endian unsigned integers that represent durations at different > granularities of time. The first stores a number in months, the second stores > a number in days, and the third stores a number in milliseconds. This > representation is independent of any particular timezone or date._ > Need to support writing and reading values of Catalyst's CalendarIntervalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 38 executors (see attachment), which is exactly (152 + 3) / 4 = 38. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-28403 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 38 executors (see attachment), which is exactly (152 + 3) / 4 = 38. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > Attachments: Screen Shot 2020-01-15 at 11.13.17.png > > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it
[jira] [Resolved] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster
[ https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-30495. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27191 [https://github.com/apache/spark/pull/27191] > How to disable 'spark.security.credentials.${service}.enabled' in Structured > streaming while connecting to a kafka cluster > -- > > Key: SPARK-30495 > URL: https://issues.apache.org/jira/browse/SPARK-30495 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: act_coder >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.0.0 > > > Trying to read data from a secured Kafka cluster using spark structured > streaming. Also, using the below library to read the data - > +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to > specify our custom group id (instead of spark setting its own custom group > id) > +*Dependency used in code:*+ > org.apache.spark > spark-sql-kafka-0-10_2.12 > 3.0.0-preview > > +*Logs:*+ > Getting the below error - even after specifying the required JAAS > configuration in spark options. > Caused by: java.lang.IllegalArgumentException: requirement failed: > *Delegation token must exist for this connector*. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275) > > +*Spark configuration used to read from Kafka:*+ > val kafkaDF = sparkSession.readStream > .format("kafka") > .option("kafka.bootstrap.servers", bootStrapServer) > .option("subscribe", kafkaTopic ) > > //Setting JAAS Configuration > .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL) > .option("kafka.sasl.mechanism", "PLAIN") > .option("kafka.security.protocol", "SASL_SSL") > // Setting custom consumer group id > .option("kafka.group.id", "test_cg") > .load() > > Following document specifies that we can disable the feature of obtaining > delegation token - > > [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html] > Tried setting this property *spark.security.credentials.kafka.enabled to* > *false in spark config,* but it is still failing with the same error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster
[ https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-30495: -- Assignee: Gabor Somogyi > How to disable 'spark.security.credentials.${service}.enabled' in Structured > streaming while connecting to a kafka cluster > -- > > Key: SPARK-30495 > URL: https://issues.apache.org/jira/browse/SPARK-30495 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: act_coder >Assignee: Gabor Somogyi >Priority: Major > > Trying to read data from a secured Kafka cluster using spark structured > streaming. Also, using the below library to read the data - > +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to > specify our custom group id (instead of spark setting its own custom group > id) > +*Dependency used in code:*+ > org.apache.spark > spark-sql-kafka-0-10_2.12 > 3.0.0-preview > > +*Logs:*+ > Getting the below error - even after specifying the required JAAS > configuration in spark options. > Caused by: java.lang.IllegalArgumentException: requirement failed: > *Delegation token must exist for this connector*. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533) > at > > org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275) > > +*Spark configuration used to read from Kafka:*+ > val kafkaDF = sparkSession.readStream > .format("kafka") > .option("kafka.bootstrap.servers", bootStrapServer) > .option("subscribe", kafkaTopic ) > > //Setting JAAS Configuration > .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL) > .option("kafka.sasl.mechanism", "PLAIN") > .option("kafka.security.protocol", "SASL_SSL") > // Setting custom consumer group id > .option("kafka.group.id", "test_cg") > .load() > > Following document specifies that we can disable the feature of obtaining > delegation token - > > [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html] > Tried setting this property *spark.security.credentials.kafka.enabled to* > *false in spark config,* but it is still failing with the same error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3
[ https://issues.apache.org/jira/browse/SPARK-30190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016263#comment-17016263 ] Steve Loughran commented on SPARK-30190: S3A creates a dir marker and deletes it But I'd rather you do the mkdir() call and only if that fails look at the dest (getFileStatus) and raise an exception if it isn't a directory. > HistoryServerDiskManager will fail on appStoreDir in s3 > --- > > Key: SPARK-30190 > URL: https://issues.apache.org/jira/browse/SPARK-30190 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: thierry accart >Priority: Major > > Hi > While setting spark.eventLog.dir to s3a://... I realized that it *requires > destination directory to preexists for S3* > This is explained I think in HistoryServerDiskManager's appStoreDir: it tries > check if directory exists or can be created > {{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new > IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}} > But in S3, a directory does not exists and cannot be created: directories > don't exists by themselves, they are only materialized due to existence of > objects. > Before proposing a patch, I wanted to know what are the prefered options : > should we have a spark option to skip the appStoreDir test, or skip it only > when a particular scheme is set, have a custom implementation of > HistoryServerDiskManager ...? > > _Note for people facing the {{IllegalArgumentException:}} {{Failed to create > app directory}} *you just have to put an empty file in bucket destination > 'path'*._ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 38 executors (see attachment), which is exactly (152 + 3) / 4 = 38. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors (see attachment). h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > Attachments: Screen Shot 2020-01-15 at 11.13.17.png > > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Attachment: Screen Shot 2020-01-15 at 11.13.17.png > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > Attachments: Screen Shot 2020-01-15 at 11.13.17.png > > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} > pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 > {code} > while the job only had 1 speculative task running and 16 speculative tasks > intentionally killed because of corresponding original tasks had finished. > An easy repro of the issue (`--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in > cluster mode): > {code:java} > val n = 4000 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index < 300 && index >= 150) { > Thread.sleep(index * 1000) // Fake running tasks > } else if (index == 300) { > Thread.sleep(1000 * 1000) // Fake long running tasks > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} > You will see when running the last task, we would be hold 39 executors (see > attachment). > h3. The Bug > Upon examining the code of _pendingSpeculativeTasks_: > {code:java} > stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => > numTasks - > stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) > }.sum > {code} > where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on > _onSpeculativeTaskSubmitted_, but never decremented. > _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage > completion. *This means Spark is marking ended speculative tasks as pending, > which leads to Spark to hold more executors that it actually needs!* > I will have a PR ready to fix this issue, along with SPARK-2840 too > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors: h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors: !image-2020-01-15-11-09-29-215.png! h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: >
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors (see attachment). h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors: h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} >
[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors
[ https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zebing Lin updated SPARK-30511: --- Description: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. An easy repro of the issue (`--conf spark.speculation=true --conf spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in cluster mode): {code:java} val n = 4000 val someRDD = sc.parallelize(1 to n, n) someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { if (index < 300 && index >= 150) { Thread.sleep(index * 1000) // Fake running tasks } else if (index == 300) { Thread.sleep(1000 * 1000) // Fake long running tasks } it.toList.map(x => index + ", " + x).iterator }).collect {code} You will see when running the last task, we would be hold 39 executors: !image-2020-01-15-11-09-29-215.png! h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too was: *TL;DR* When speculative tasks finished/failed/got killed, they are still considered as pending and count towards the calculation of number of needed executors. h3. Symptom In one of our production job (where it's running 4 tasks per executor), we found that it was holding 6 executors at the end with only 2 tasks running (1 speculative). With more logging enabled, we found the job printed: {code:java} pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 {code} while the job only had 1 speculative task running and 16 speculative tasks intentionally killed because of corresponding original tasks had finished. h3. The Bug Upon examining the code of _pendingSpeculativeTasks_: {code:java} stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) => numTasks - stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0) }.sum {code} where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on _onSpeculativeTaskSubmitted_, but never decremented. _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage completion. *This means Spark is marking ended speculative tasks as pending, which leads to Spark to hold more executors that it actually needs!* I will have a PR ready to fix this issue, along with SPARK-2840 too > Spark marks ended speculative tasks as pending leads to holding idle executors > -- > > Key: SPARK-30511 > URL: https://issues.apache.org/jira/browse/SPARK-30511 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Zebing Lin >Priority: Major > > *TL;DR* > When speculative tasks finished/failed/got killed, they are still considered > as pending and count towards the calculation of number of needed executors. > h3. Symptom > In one of our production job (where it's running 4 tasks per executor), we > found that it was holding 6 executors at the end with only 2 tasks running (1 > speculative). With more logging enabled, we found the job printed: > {code:java} > pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2 > {code} > while the job only had 1 speculative task running and 16 speculative tasks > intentionally killed because of corresponding original tasks had finished. > An easy repro of the issue (`--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in > cluster mode): > {code:java} > val n = 4000 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index < 300 && index >= 150) { >
[jira] [Resolved] (SPARK-30479) Apply compaction of event log to SQL events
[ https://issues.apache.org/jira/browse/SPARK-30479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-30479. Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27164 [https://github.com/apache/spark/pull/27164] > Apply compaction of event log to SQL events > --- > > Key: SPARK-30479 > URL: https://issues.apache.org/jira/browse/SPARK-30479 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. > This issue depends on SPARK-29779 and focuses on dealing with SQL events. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30479) Apply compaction of event log to SQL events
[ https://issues.apache.org/jira/browse/SPARK-30479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin reassigned SPARK-30479: -- Assignee: Jungtaek Lim > Apply compaction of event log to SQL events > --- > > Key: SPARK-30479 > URL: https://issues.apache.org/jira/browse/SPARK-30479 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > This issue is to track the effort on compacting old event logs (and cleaning > up after compaction) without breaking guaranteeing of compatibility. > This issue depends on SPARK-29779 and focuses on dealing with SQL events. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016213#comment-17016213 ] Sowmya commented on SPARK-18165: Hi [~itsvikramagr] ,Can this library be used in production with spark 2.4 code rather than command line? I tried to use it with spark code(sbt in intellij) I tried to use it with this code and see an error "Cannot find data source kinesis". val spark = SparkSession.builder .master("local") .appName("Spark Kinesis Connector") .getOrCreate() println(spark) val kinesis = spark .readStream .format("kinesis") .option("streamName", "stream name") .option("endpointUrl", "kinesis endpoint") .option("startingposition", "TRIM_HORIZON") .load > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016200#comment-17016200 ] ShivaKumar SS commented on SPARK-26570: --- Unfortunately I have hit the same issue. [https://stackoverflow.com/questions/59757268/spark-read-multiple-column-partitioned-csv-files-out-of-memory] Is there any workaround for this. ? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: image-2019-10-13-18-41-22-090.png, > image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, > image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, > image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, > screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] phanikumar updated SPARK-30522: --- Summary: Spark Streaming dynamic executors override or take default kafka parameters in cluster mode (was: Spark Streaming dynamic executors overried kafka parameters in cluster mode) > Spark Streaming dynamic executors override or take default kafka parameters > in cluster mode > --- > > Key: SPARK-30522 > URL: https://issues.apache.org/jira/browse/SPARK-30522 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.2 >Reporter: phanikumar >Priority: Major > > I have written a spark streaming consumer to consume the data from Kafka. I > found a weird behavior in my logs. The Kafka topic has 3 partitions and for > each partition, an executor is launched by Spark Streaming job.I have written > a spark streaming consumer to consume the data from Kafka. I found a weird > behavior in my logs. The Kafka topic has 3 partitions and for each partition, > an executor is launched by Spark Streaming job. > The first executor id always takes the parameters I have provided while > creating the streaming context but the executor with ID 2 and 3 always > override the kafka parameters. > > {code:java} > 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for > this application. Enabling Dynamic allocation for Spark Streaming > applications can cause data loss if Write Ahead Log is not enabled for > non-replayable sour ces like Flume. See the programming guide for details > on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO > FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log > files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 > 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 > 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x > Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint > interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember > interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: > Initialized and validated > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f > 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 > 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated > 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 > 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 > 12:15:05 INFO ForEachDStream: Initialized and validated > org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 > 12:15:05 INFO ConsumerConfig: ConsumerConfig values: > auto.commit.interval.ms = 5000 auto.offset.reset = latest > bootstrap.servers = [1,2,3] check.crcs = true > client.id = client-0 connections.max.idle.ms = 54 > default.api.timeout.ms = 6 enable.auto.commit = false > exclude.internal.topics = true fetch.max.bytes = 52428800 > fetch.max.wait.ms = 500 fetch.min.bytes = 1 > group.id = telemetry-streaming-service heartbeat.interval.ms = > 3000 interceptor.classes = [] > internal.leave.group.on.close = true isolation.level = > read_uncommitted key.deserializer = class > org.apache.kafka.common.serialization.StringDeserializer > > {code} > Here is the log for other executors. > > {code:java} > 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 > 20/01/14 12:15:04 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. > 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 > 20/01/14 12:15:04 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 > INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, > matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: > external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: > Registering executor with local external shuffle service. 20/01/14 > 12:15:04 INFO TransportClientFactory: Successfully created connection to > matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) > 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: >
[jira] [Created] (SPARK-30522) Spark Streaming dynamic executors overried kafka parameters in cluster mode
phanikumar created SPARK-30522: -- Summary: Spark Streaming dynamic executors overried kafka parameters in cluster mode Key: SPARK-30522 URL: https://issues.apache.org/jira/browse/SPARK-30522 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.3.2 Reporter: phanikumar I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job.I have written a spark streaming consumer to consume the data from Kafka. I found a weird behavior in my logs. The Kafka topic has 3 partitions and for each partition, an executor is launched by Spark Streaming job. The first executor id always takes the parameters I have provided while creating the streaming context but the executor with ID 2 and 3 always override the kafka parameters. {code:java} 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this application. Enabling Dynamic allocation for Spark Streaming applications can cause data loss if Write Ahead Log is not enabled for non-replayable sour ces like Flume. See the programming guide for details on how to enable the Write Ahead Log. 20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null 20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms 20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac 20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values: auto.commit.interval.ms = 5000 auto.offset.reset = latest bootstrap.servers = [1,2,3] check.crcs = true client.id = client-0 connections.max.idle.ms = 54 default.api.timeout.ms = 6 enable.auto.commit = false exclude.internal.topics = true fetch.max.bytes = 52428800 fetch.max.wait.ms = 500 fetch.min.bytes = 1 group.id = telemetry-streaming-service heartbeat.interval.ms = 3000 interceptor.classes = [] internal.leave.group.on.close = true isolation.level = read_uncommitted key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer {code} Here is the log for other executors. {code:java} 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1 20/01/14 12:15:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324. 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1 20/01/14 12:15:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447 20/01/14 12:15:04 INFO BlockManager: Registering executor with local external shuffle service. 20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps) 20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, matrix-hwork-data-05, 40324, None) 20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1 20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) 20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0 20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 8.1 KB, free 6.2 GB) 20/01/14
[jira] [Resolved] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
[ https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30504. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27190 [https://github.com/apache/spark/pull/27190] > OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol > > > Key: SPARK-30504 > URL: https://issues.apache.org/jira/browse/SPARK-30504 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.0.0 > > > Current behaviour > {code:python} > from pyspark.ml.classification import LogisticRegression, OneVsRest, > OneVsRestModel > from pyspark.ml.linalg import DenseVector > df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, > DenseVector([1.0, 0.0]))], ("label", "w", "features")) > ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w") > ovrm = ovr.fit(df) > ovr.getWeightCol() > ## 'w' > ovrm.getWeightCol() > ## 'w' > ovr.write().overwrite().save("/tmp/ovr") > ovr_ = OneVsRest.load("/tmp/ovr") > ovr_.getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', > doc='weight column name. ...) > ovrm.write().overwrite().save("/tmp/ovrm") > ovrm_ = OneVsRestModel.load("/tmp/ovrm") > ovrm_ .getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', > doc='weight column name ... > {code} > Expected behaviour: > {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have > {{weightCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
[ https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-30504: - Priority: Minor (was: Major) > OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol > > > Key: SPARK-30504 > URL: https://issues.apache.org/jira/browse/SPARK-30504 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 3.0.0 > > > Current behaviour > {code:python} > from pyspark.ml.classification import LogisticRegression, OneVsRest, > OneVsRestModel > from pyspark.ml.linalg import DenseVector > df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, > DenseVector([1.0, 0.0]))], ("label", "w", "features")) > ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w") > ovrm = ovr.fit(df) > ovr.getWeightCol() > ## 'w' > ovrm.getWeightCol() > ## 'w' > ovr.write().overwrite().save("/tmp/ovr") > ovr_ = OneVsRest.load("/tmp/ovr") > ovr_.getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', > doc='weight column name. ...) > ovrm.write().overwrite().save("/tmp/ovrm") > ovrm_ = OneVsRestModel.load("/tmp/ovrm") > ovrm_ .getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', > doc='weight column name ... > {code} > Expected behaviour: > {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have > {{weightCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
[ https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30504: Assignee: Maciej Szymkiewicz > OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol > > > Key: SPARK-30504 > URL: https://issues.apache.org/jira/browse/SPARK-30504 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > Current behaviour > {code:python} > from pyspark.ml.classification import LogisticRegression, OneVsRest, > OneVsRestModel > from pyspark.ml.linalg import DenseVector > df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, > DenseVector([1.0, 0.0]))], ("label", "w", "features")) > ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w") > ovrm = ovr.fit(df) > ovr.getWeightCol() > ## 'w' > ovrm.getWeightCol() > ## 'w' > ovr.write().overwrite().save("/tmp/ovr") > ovr_ = OneVsRest.load("/tmp/ovr") > ovr_.getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', > doc='weight column name. ...) > ovrm.write().overwrite().save("/tmp/ovrm") > ovrm_ = OneVsRestModel.load("/tmp/ovrm") > ovrm_ .getWeightCol() > ## KeyError > ## ... > ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', > doc='weight column name ... > {code} > Expected behaviour: > {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have > {{weightCol}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo
Maxim Gekk created SPARK-30521: -- Summary: Eliminate deprecation warnings for ExpressionInfo Key: SPARK-30521 URL: https://issues.apache.org/jira/browse/SPARK-30521 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk {code} Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended usage"), Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended usage"), Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is deprecated: see corresponding Javadoc for more information. new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended usage"), {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30520) Eliminate deprecation warnings for UserDefinedAggregateFunction
Maxim Gekk created SPARK-30520: -- Summary: Eliminate deprecation warnings for UserDefinedAggregateFunction Key: SPARK-30520 URL: https://issues.apache.org/jira/browse/SPARK-30520 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk {code} /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala Warning:Warning:line (718)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. val udaf = clazz.getConstructor().newInstance().asInstanceOf[UserDefinedAggregateFunction] Warning:Warning:line (719)method register in class UDFRegistration is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. register(name, udaf) /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala Warning:Warning:line (328)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. udaf: UserDefinedAggregateFunction, Warning:Warning:line (326)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. case class ScalaUDAF( /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala Warning:Warning:line (363)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. val udaf = new UserDefinedAggregateFunction { /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleSum.java Warning:Warning:line (25)java: org.apache.spark.sql.expressions.UserDefinedAggregateFunction in org.apache.spark.sql.expressions has been deprecated Warning:Warning:line (35)java: org.apache.spark.sql.expressions.UserDefinedAggregateFunction in org.apache.spark.sql.expressions has been deprecated /Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleAvg.java Warning:Warning:line (25)java: org.apache.spark.sql.expressions.UserDefinedAggregateFunction in org.apache.spark.sql.expressions has been deprecated Warning:Warning:line (36)java: org.apache.spark.sql.expressions.UserDefinedAggregateFunction in org.apache.spark.sql.expressions has been deprecated /Users/maxim/proj/eliminate-expr-info-warnings/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala Warning:Warning:line (36)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. class ScalaAggregateFunction(schema: StructType) extends UserDefinedAggregateFunction { Warning:Warning:line (73)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. class ScalaAggregateFunctionWithoutInputSchema extends UserDefinedAggregateFunction { Warning:Warning:line (100)class UserDefinedAggregateFunction in package expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. class LongProductSum extends UserDefinedAggregateFunction { Warning:Warning:line (189)method register in class UDFRegistration is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. spark.udf.register("mydoublesum", new MyDoubleSum) Warning:Warning:line (190)method register in class UDFRegistration is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. spark.udf.register("mydoubleavg", new MyDoubleAvg) Warning:Warning:line (191)method register in class UDFRegistration is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. spark.udf.register("longProductSum", new LongProductSum) Warning:Warning:line (943)method register in class UDFRegistration is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as a UDF via the functions.udaf(agg) method. spark.udf.register("noInputSchema", new ScalaAggregateFunctionWithoutInputSchema)
[jira] [Updated] (SPARK-30165) Eliminate compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-30165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30165: --- Attachment: spark_warnings.txt > Eliminate compilation warnings > -- > > Key: SPARK-30165 > URL: https://issues.apache.org/jira/browse/SPARK-30165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: spark_warnings.txt > > > This is an umbrella ticket for sub-tasks for eliminating compilation > warnings. I dumped all warnings to the spark_warnings.txt file attached to > the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30165) Eliminate compilation warnings
[ https://issues.apache.org/jira/browse/SPARK-30165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30165: --- Attachment: (was: spark_warnings.txt) > Eliminate compilation warnings > -- > > Key: SPARK-30165 > URL: https://issues.apache.org/jira/browse/SPARK-30165 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: spark_warnings.txt > > > This is an umbrella ticket for sub-tasks for eliminating compilation > warnings. I dumped all warnings to the spark_warnings.txt file attached to > the ticket. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29708) Different answers in aggregates of duplicate grouping sets
[ https://issues.apache.org/jira/browse/SPARK-29708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-29708. -- Fix Version/s: 3.0.0 Assignee: Takeshi Yamamuro Resolution: Fixed Resolved by https://github.com/apache/spark/pull/26961 > Different answers in aggregates of duplicate grouping sets > -- > > Key: SPARK-29708 > URL: https://issues.apache.org/jira/browse/SPARK-29708 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > A query below with multiple grouping sets seems to have different answers > between PgSQL and Spark; > {code:java} > postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), > unsortable_col xid); > postgres=# insert into gstest4 > postgres-# values (1,1,b'','1'), (2,2,b'0001','1'), > postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'), > postgres-#(5,16,b'','2'), (6,32,b'0001','2'), > postgres-#(7,64,b'0010','1'), (8,128,b'0011','1'); > INSERT 0 8 > postgres=# select unsortable_col, count(*) > postgres-# from gstest4 group by grouping sets > ((unsortable_col),(unsortable_col)) > postgres-# order by text(unsortable_col); > unsortable_col | count > +--- > 1 | 8 > 1 | 8 > 2 | 8 > 2 | 8 > (4 rows) > {code} > {code:java} > scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* > bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""") > scala> sql(""" > | insert into gstest4 > | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1), > |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2), > |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2), > |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1) > | """) > res21: org.apache.spark.sql.DataFrame = [] > scala> > scala> sql(""" > | select unsortable_col, count(*) > | from gstest4 group by grouping sets > ((unsortable_col),(unsortable_col)) > | order by string(unsortable_col) > | """).show > +--++ > |unsortable_col|count(1)| > +--++ > | 1| 8| > | 2| 8| > +--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce the time complexity
[ https://issues.apache.org/jira/browse/SPARK-30515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-30515. -- Fix Version/s: 3.0.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/27212 > Refactor SimplifyBinaryComparison to reduce the time complexity > --- > > Key: SPARK-30515 > URL: https://issues.apache.org/jira/browse/SPARK-30515 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.0.0 > > > The improvement of the rule SimplifyBinaryComparison in PR > https://github.com/apache/spark/pull/27008 could bring performance regression > in the optimizer. > We need to improve the implementation and reduce the time complexity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30177) Eliminate warnings: part 7
[ https://issues.apache.org/jira/browse/SPARK-30177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30177: --- Summary: Eliminate warnings: part 7 (was: Eliminate warnings: part7) > Eliminate warnings: part 7 > -- > > Key: SPARK-30177 > URL: https://issues.apache.org/jira/browse/SPARK-30177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > /mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala > Warning:Warning:line (108)method computeCost in class > BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated > and will be removed in future versions. Use ClusteringEvaluator instead. You > can also get the cost on the training dataset in the summary. > assert(model.computeCost(dataset) < 0.1) > Warning:Warning:line (135)method computeCost in class > BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated > and will be removed in future versions. Use ClusteringEvaluator instead. You > can also get the cost on the training dataset in the summary. > assert(model.computeCost(dataset) == summary.trainingCost) > Warning:Warning:line (195)method computeCost in class > BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated > and will be removed in future versions. Use ClusteringEvaluator instead. You > can also get the cost on the training dataset in the summary. > model.computeCost(dataset) > > /sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala > Warning:Warning:line (105)Java enum ALLOW_UNQUOTED_CONTROL_CHARS in Java > enum Feature is deprecated: see corresponding Javadoc for more information. > jsonFactory.enable(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS) > /sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java > Warning:Warning:line (28)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (37)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (46)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (55)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > Warning:Warning:line (64)java: > org.apache.spark.sql.expressions.javalang.typed in > org.apache.spark.sql.expressions.javalang has been deprecated > /sql/core/src/test/java/test/org/apache/spark/sql/JavaTestUtils.java > Information:Information:java: > /Users/maxim/proj/eliminate-warning/sql/core/src/test/java/test/org/apache/spark/sql/JavaTestUtils.java > uses unchecked or unsafe operations. > Information:Information:java: Recompile with -Xlint:unchecked for details. > /sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java > Warning:Warning:line (478)java: > json(org.apache.spark.api.java.JavaRDD) in > org.apache.spark.sql.DataFrameReader has been deprecated -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30172) Eliminate warnings: part 3
[ https://issues.apache.org/jira/browse/SPARK-30172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30172: --- Summary: Eliminate warnings: part 3 (was: Eliminate warnings: part3) > Eliminate warnings: part 3 > -- > > Key: SPARK-30172 > URL: https://issues.apache.org/jira/browse/SPARK-30172 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > /sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala > Warning:Warning:line (422)method initialize in class AbstractSerDe is > deprecated: see corresponding Javadoc for more information. > serde.initialize(null, properties) > /sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala > Warning:Warning:line (216)method initialize in class GenericUDTF is > deprecated: see corresponding Javadoc for more information. > protected lazy val outputInspector = > function.initialize(inputInspectors.toArray) > Warning:Warning:line (342)class UDAF in package exec is deprecated: see > corresponding Javadoc for more information. > new GenericUDAFBridge(funcWrapper.createFunction[UDAF]()) > Warning:Warning:line (503)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > def serialize(buffer: AggregationBuffer): Array[Byte] = { > Warning:Warning:line (523)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > def deserialize(bytes: Array[Byte]): AggregationBuffer = { > Warning:Warning:line (538)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean) > Warning:Warning:line (538)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean) > /sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkOrcNewRecordReader.java > Warning:Warning:line (44)java: getTypes() in org.apache.orc.Reader has > been deprecated > Warning:Warning:line (47)java: getTypes() in org.apache.orc.Reader has > been deprecated > /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala > Warning:Warning:line (2,368)method readFooter in class ParquetFileReader > is deprecated: see corresponding Javadoc for more information. > val footer = ParquetFileReader.readFooter( > /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDAFSuite.scala > Warning:Warning:line (202)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def getNewAggregationBuffer: AggregationBuffer = new > MockUDAFBuffer(0L, 0L) > Warning:Warning:line (204)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def reset(agg: AggregationBuffer): Unit = { > Warning:Warning:line (212)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): > Unit = { > Warning:Warning:line (221)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def merge(agg: AggregationBuffer, partial: Object): Unit = { > Warning:Warning:line (231)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def terminatePartial(agg: AggregationBuffer): AnyRef = { > Warning:Warning:line (236)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def terminate(agg: AggregationBuffer): AnyRef = > terminatePartial(agg) > Warning:Warning:line (257)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def getNewAggregationBuffer: AggregationBuffer = { > Warning:Warning:line (266)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def reset(agg: AggregationBuffer): Unit = { > Warning:Warning:line (277)trait AggregationBuffer in class > GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more > information. > override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): > Unit = {
[jira] [Updated] (SPARK-30174) Eliminate warnings: part 4
[ https://issues.apache.org/jira/browse/SPARK-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30174: --- Summary: Eliminate warnings: part 4 (was: Eliminate warnings :part 4) > Eliminate warnings: part 4 > -- > > Key: SPARK-30174 > URL: https://issues.apache.org/jira/browse/SPARK-30174 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala > {code:java} > Warning:Warning:line (127)value ENABLE_JOB_SUMMARY in class > ParquetOutputFormat is deprecated: see corresponding Javadoc for more > information. > && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { > Warning:Warning:line (261)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > new org.apache.parquet.hadoop.ParquetInputSplit( > Warning:Warning:line (272)method readFooter in class ParquetFileReader is > deprecated: see corresponding Javadoc for more information. > ParquetFileReader.readFooter(sharedConf, filePath, > SKIP_ROW_GROUPS).getFileMetaData > Warning:Warning:line (442)method readFooter in class ParquetFileReader is > deprecated: see corresponding Javadoc for more information. > ParquetFileReader.readFooter( > {code} > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala > {code:java} > Warning:Warning:line (91)value ENABLE_JOB_SUMMARY in class > ParquetOutputFormat is deprecated: see corresponding Javadoc for more > information. > && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30171) Eliminate warnings: part 2
[ https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30171: --- Summary: Eliminate warnings: part 2 (was: Eliminate warnings: part2) > Eliminate warnings: part 2 > -- > > Key: SPARK-30171 > URL: https://issues.apache.org/jira/browse/SPARK-30171 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > AvroFunctionsSuite.scala > Warning:Warning:line (41)method to_avro in package avro is deprecated (since > 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead. > val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b")) > Warning:Warning:line (41)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b")) > Warning:Warning:line (54)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, > avroTypeStr)), df) > Warning:Warning:line (54)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, > avroTypeStr)), df) > Warning:Warning:line (59)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroStructDF = df.select(to_avro('struct).as("avro")) > Warning:Warning:line (70)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df) > Warning:Warning:line (76)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroStructDF = df.select(to_avro('struct).as("avro")) > Warning:Warning:line (118)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val readBackOne = dfOne.select(to_avro($"array").as("avro")) > Warning:Warning:line (119)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > .select(from_avro($"avro", avroTypeArrStruct).as("array")) > AvroPartitionReaderFactory.scala > Warning:Warning:line (64)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || > partitionedFile.filePath.endsWith(".avro")) { > AvroFileFormat.scala > Warning:Warning:line (98)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) { > AvroUtils.scala > Warning:Warning:line (55)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30174) Eliminate warnings :part 4
[ https://issues.apache.org/jira/browse/SPARK-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-30174. Resolution: Duplicate > Eliminate warnings :part 4 > -- > > Key: SPARK-30174 > URL: https://issues.apache.org/jira/browse/SPARK-30174 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: jobit mathew >Priority: Minor > > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala > {code:java} > Warning:Warning:line (127)value ENABLE_JOB_SUMMARY in class > ParquetOutputFormat is deprecated: see corresponding Javadoc for more > information. > && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { > Warning:Warning:line (261)class ParquetInputSplit in package hadoop is > deprecated: see corresponding Javadoc for more information. > new org.apache.parquet.hadoop.ParquetInputSplit( > Warning:Warning:line (272)method readFooter in class ParquetFileReader is > deprecated: see corresponding Javadoc for more information. > ParquetFileReader.readFooter(sharedConf, filePath, > SKIP_ROW_GROUPS).getFileMetaData > Warning:Warning:line (442)method readFooter in class ParquetFileReader is > deprecated: see corresponding Javadoc for more information. > ParquetFileReader.readFooter( > {code} > sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala > {code:java} > Warning:Warning:line (91)value ENABLE_JOB_SUMMARY in class > ParquetOutputFormat is deprecated: see corresponding Javadoc for more > information. > && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) { > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30171) Eliminate warnings: part2
[ https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-30171. Resolution: Won't Fix As per discussion in [https://github.com/apache/spark/pull/27174#discussion_r365629991] , the warning exists till Spark 3.0 > Eliminate warnings: part2 > - > > Key: SPARK-30171 > URL: https://issues.apache.org/jira/browse/SPARK-30171 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > AvroFunctionsSuite.scala > Warning:Warning:line (41)method to_avro in package avro is deprecated (since > 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead. > val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b")) > Warning:Warning:line (41)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b")) > Warning:Warning:line (54)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, > avroTypeStr)), df) > Warning:Warning:line (54)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, > avroTypeStr)), df) > Warning:Warning:line (59)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroStructDF = df.select(to_avro('struct).as("avro")) > Warning:Warning:line (70)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df) > Warning:Warning:line (76)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val avroStructDF = df.select(to_avro('struct).as("avro")) > Warning:Warning:line (118)method to_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' > instead. > val readBackOne = dfOne.select(to_avro($"array").as("avro")) > Warning:Warning:line (119)method from_avro in package avro is deprecated > (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' > instead. > .select(from_avro($"avro", avroTypeArrStruct).as("array")) > AvroPartitionReaderFactory.scala > Warning:Warning:line (64)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || > partitionedFile.filePath.endsWith(".avro")) { > AvroFileFormat.scala > Warning:Warning:line (98)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) { > AvroUtils.scala > Warning:Warning:line (55)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension, -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30490) Eliminate warnings in Avro datasource
[ https://issues.apache.org/jira/browse/SPARK-30490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-30490. Resolution: Won't Fix See [https://github.com/apache/spark/pull/27174#discussion_r365629991] > Eliminate warnings in Avro datasource > - > > Key: SPARK-30490 > URL: https://issues.apache.org/jira/browse/SPARK-30490 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Here is the list of deprecation warning in Avro: > {code} > /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala > Warning:Warning:line (98)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) { > /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala > Warning:Warning:line (55)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension, > /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala > Warning:Warning:line (64)value ignoreExtension in class AvroOptions is > deprecated (since 3.0): Use the general data source option pathGlobFilter for > filtering file names > if (parsedOptions.ignoreExtension || > partitionedFile.filePath.endsWith(".avro")) { > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30519) Executor can't use spark.executorEnv.HADOOP_USER_NAME to change the user accessing to hdfs
Xiaoming created SPARK-30519: Summary: Executor can't use spark.executorEnv.HADOOP_USER_NAME to change the user accessing to hdfs Key: SPARK-30519 URL: https://issues.apache.org/jira/browse/SPARK-30519 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Reporter: Xiaoming Currently, we can specify hadoop user by setting HADOOP_USER_NAME on driver when submit a job. However it's invalid to executor by setting spark.executorEnv.HADOOP_USER_NAME. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30518) Precision and scale should be same for values between -1.0 and 1.0 in Decimal
wuyi created SPARK-30518: Summary: Precision and scale should be same for values between -1.0 and 1.0 in Decimal Key: SPARK-30518 URL: https://issues.apache.org/jira/browse/SPARK-30518 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Currently, for values between -1.0 and 1.0, precision and scale is inconsistent between Decimal and DecimalType. For example, for numbers like 0.3, it has (precision, scale) as (2, 1) in Decimal, but (1, 1) in DecimalType. We should make Decimal be consistent with DecimalType. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30517) Support SHOW TABLES EXTENDED
[ https://issues.apache.org/jira/browse/SPARK-30517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015706#comment-17015706 ] Ajith S edited comment on SPARK-30517 at 1/15/20 8:01 AM: -- [~srowen] [~dongjoon] [~vanzin] Please let me know your opinions about proposal. I would like to work if its acceptable was (Author: ajithshetty): [~srowen] [~dongjoon] [~vanzin] Please let me know about your opinion about proposal. I would like to work if its acceptable > Support SHOW TABLES EXTENDED > > > Key: SPARK-30517 > URL: https://issues.apache.org/jira/browse/SPARK-30517 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Major > > {{Intention is to support show tables with a additional column 'type' where > type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of > required types, like listing only views or only external tables (using a > 'where' clause over 'type' column).}} > {{Usecase example:}} > {{Currently its not possible to list all the VIEWS, but other technologies > like hive support it using 'SHOW VIEWS', mysql supports it using a more > complex query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}} > Decide to take mysql approach as it provides more flexibility for querying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30517) Support SHOW TABLES EXTENDED
[ https://issues.apache.org/jira/browse/SPARK-30517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015706#comment-17015706 ] Ajith S commented on SPARK-30517: - [~srowen] [~dongjoon] [~vanzin] Please let me know about your opinion about proposal. I would like to work if its acceptable > Support SHOW TABLES EXTENDED > > > Key: SPARK-30517 > URL: https://issues.apache.org/jira/browse/SPARK-30517 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ajith S >Priority: Major > > {{Intention is to support show tables with a additional column 'type' where > type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of > required types, like listing only views or only external tables (using a > 'where' clause over 'type' column).}} > {{Usecase example:}} > {{Currently its not possible to list all the VIEWS, but other technologies > like hive support it using 'SHOW VIEWS', mysql supports it using a more > complex query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}} > Decide to take mysql approach as it provides more flexibility for querying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org