[jira] [Updated] (SPARK-29450) [SS] In streaming aggregation, metric for output rows is not measured in append mode

2020-01-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29450:

Fix Version/s: 2.4.5

> [SS] In streaming aggregation, metric for output rows is not measured in 
> append mode
> 
>
> Key: SPARK-29450
> URL: https://issues.apache.org/jira/browse/SPARK-29450
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> The bug is reported in dev. mailing list. Credit to [~jlaskowski]. Quoting 
> here:
>  
> {quote}
> I've just noticed that the number of output rows metric of StateStoreSaveExec 
> physical operator does not seem to be measured for Append output mode. In 
> other words, whatever happens before or after StateStoreSaveExec operator the 
> metric is always 0.
>  
> It is measured for the other modes - Complete and Update.
> {quote}
> https://lists.apache.org/thread.html/eec318f7ff84c700bffc8bdb7c95c0dcaffb2494ac7ccd7a7c7c6588@%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-15 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249
 ] 

t oo edited comment on SPARK-27750 at 1/16/20 6:33 AM:
---

WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] 
[~zsxwing] [~jlaskowski] [~cloud_fan] [~srowen] [~dongjoon] [~hyukjin.kwon]


was (Author: toopt4):
WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] 
[~zsxwing] [~jlaskowski] [~cloud_fan]

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo

2020-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30521.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27221
[https://github.com/apache/spark/pull/27221]

> Eliminate deprecation warnings for ExpressionInfo
> -
>
> Key: SPARK-30521
> URL: https://issues.apache.org/jira/browse/SPARK-30521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> {code}
> Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended 
> usage"),
> Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
> usage"),
> Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
> usage"),
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo

2020-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30521:


Assignee: Maxim Gekk

> Eliminate deprecation warnings for ExpressionInfo
> -
>
> Key: SPARK-30521
> URL: https://issues.apache.org/jira/browse/SPARK-30521
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> {code}
> Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended 
> usage"),
> Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
> usage"),
> Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is 
> deprecated: see corresponding Javadoc for more information.
> new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
> usage"),
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016541#comment-17016541
 ] 

Dongjoon Hyun commented on SPARK-27868:
---

Hi, [~vanzin]. Since there is no proper way to set `Fixed Version` for 
reverting, I removed `2.4.4` tag and add `release-note` for 2.4.5.
Please advise me if there is a better way~ Thanks.

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27868:
--
Labels: release-notes  (was: )

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27868:
--
Fix Version/s: (was: 2.4.4)

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-15 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016540#comment-17016540
 ] 

Dongjoon Hyun commented on SPARK-27868:
---

This is reverted from `branch-2.4` via 
https://github.com/apache/spark/commit/94b5d3fd64ea5498b19f9ea7aacb484a18496018

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
> Fix For: 2.4.4, 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30491) Enable dependency audit files to tell dependency classifier

2020-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30491:
-

Assignee: Xinrong Meng

> Enable dependency audit files to tell dependency classifier
> ---
>
> Key: SPARK-30491
> URL: https://issues.apache.org/jira/browse/SPARK-30491
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Dependency audit files under `dev/deps` only show jar names. Given that, it 
> is not trivial to figure out the dependency classifiers.
> For example, `avro-mapred-1.8.2-hadoop2.jar` is made up of artifact id 
> `avro-mapred`, version `1.8.2`, and classifier `hadoop2`. In contrast, 
> `htrace-core-3.1.0-incubating.jar` is made up of artifact id `htrace-core`, 
> and version `3.1.0-incubating.jar`. 
> All in all, the classifier can't be told from its position in jar name, 
> however, as part of the identifier of dependency, it should be clearly 
> figured out.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30491) Enable dependency audit files to tell dependency classifier

2020-01-15 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30491.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27177
[https://github.com/apache/spark/pull/27177]

> Enable dependency audit files to tell dependency classifier
> ---
>
> Key: SPARK-30491
> URL: https://issues.apache.org/jira/browse/SPARK-30491
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.0.0
>
>
> Dependency audit files under `dev/deps` only show jar names. Given that, it 
> is not trivial to figure out the dependency classifiers.
> For example, `avro-mapred-1.8.2-hadoop2.jar` is made up of artifact id 
> `avro-mapred`, version `1.8.2`, and classifier `hadoop2`. In contrast, 
> `htrace-core-3.1.0-incubating.jar` is made up of artifact id `htrace-core`, 
> and version `3.1.0-incubating.jar`. 
> All in all, the classifier can't be told from its position in jar name, 
> however, as part of the identifier of dependency, it should be clearly 
> figured out.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30323) Support filters pushdown in CSV datasource

2020-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30323:


Assignee: Maxim Gekk

> Support filters pushdown in CSV datasource
> --
>
> Key: SPARK-30323
> URL: https://issues.apache.org/jira/browse/SPARK-30323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> - Implement the `SupportsPushDownFilters` interface in `CSVScanBuilder`
> - Apply filters in UnivocityParser
> - Change API UnivocityParser - return Seq[InternalRow] from `convert()`
> - Update CSVBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30323) Support filters pushdown in CSV datasource

2020-01-15 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30323.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26973
[https://github.com/apache/spark/pull/26973]

> Support filters pushdown in CSV datasource
> --
>
> Key: SPARK-30323
> URL: https://issues.apache.org/jira/browse/SPARK-30323
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> - Implement the `SupportsPushDownFilters` interface in `CSVScanBuilder`
> - Apply filters in UnivocityParser
> - Change API UnivocityParser - return Seq[InternalRow] from `convert()`
> - Update CSVBenchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phanikumar updated SPARK-30522:
---
Description: 
I have written a spark streaming consumer to consume the data from Kafka. I 
found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
each partition, an executor is launched by Spark Streaming job.I have written a 
spark streaming consumer to consume the data from Kafka. I found a weird 
behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
an executor is launched by Spark Streaming job.
 The first executor id always takes the parameters I have provided while 
creating the streaming context but the executor with ID 2 and 3 always override 
the kafka parameters.
    
{code:java}
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this 
application. Enabling Dynamic allocation for Spark Streaming applications can 
cause data loss if Write Ahead Log is not enabled for non-replayable sour    
ces like Flume. See the programming guide for details on how to enable the 
Write Ahead Log.    
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 
write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 
   
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x 
Replicated    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint 
interval = null   
 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 
   20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:             
auto.commit.interval.ms = 5000            
auto.offset.reset = latest            
bootstrap.servers = [1,2,3]            
check.crcs = true            
client.id = client-0            
connections.max.idle.ms = 54            
default.api.timeout.ms = 6            
enable.auto.commit = false            
exclude.internal.topics = true            
fetch.max.bytes = 52428800            
fetch.max.wait.ms = 500            
fetch.min.bytes = 1            
group.id = telemetry-streaming-service            
heartbeat.interval.ms = 3000            
interceptor.classes = []            
internal.leave.group.on.close = true            
isolation.level = read_uncommitted            
key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
 
{code}
Here is the log for other executors.
    
{code:java}
 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    
20/01/14 12:15:04 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
20/01/14 12:15:04 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447    
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external 
shuffle service.    
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)  
  
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1    
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)    
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0   
 
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 
   
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 8.1 KB, free 6.2 GB)    
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 
ms    20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 17.9 KB, free 

[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phanikumar updated SPARK-30522:
---
Description: 
I have written a spark streaming consumer to consume the data from Kafka. I 
found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
each partition, an executor is launched by Spark Streaming job.I have written a 
spark streaming consumer to consume the data from Kafka. I found a weird 
behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
an executor is launched by Spark Streaming job.
 The first executor id always takes the parameters I have provided while 
creating the streaming context but the executor with ID 2 and 3 always override 
the kafka parameters.
    
{code:java}
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this 
application. Enabling Dynamic allocation for Spark Streaming applications can 
cause data loss if Write Ahead Log is not enabled for non-replayable sour    
ces like Flume. See the programming guide for details on how to enable the 
Write Ahead Log.    
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 
write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 
   
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x 
Replicated    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint 
interval = null   
 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 
   20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:             
auto.commit.interval.ms = 5000            
auto.offset.reset = latest            
bootstrap.servers = [1,2,3]            
check.crcs = true            
client.id = client-0            
connections.max.idle.ms = 54            
default.api.timeout.ms = 6            
enable.auto.commit = false            
exclude.internal.topics = true            
fetch.max.bytes = 52428800            
fetch.max.wait.ms = 500            
fetch.min.bytes = 1            
group.id = telemetry-streaming-service            
heartbeat.interval.ms = 3000            
interceptor.classes = []            
internal.leave.group.on.close = true            
isolation.level = read_uncommitted            
key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
 
{code}
Here is the log for other executors.
    
{code:java}
 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    
20/01/14 12:15:04 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
20/01/14 12:15:04 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447    
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external 
shuffle service.    
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)  
  
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1    
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)    
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0   
 
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 
   
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 8.1 KB, free 6.2 GB)    
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 
ms    20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 17.9 KB, free 

[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phanikumar updated SPARK-30522:
---
Description: 
I have written a spark streaming consumer to consume the data from Kafka. I 
found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
each partition, an executor is launched by Spark Streaming job.I have written a 
spark streaming consumer to consume the data from Kafka. I found a weird 
behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
an executor is launched by Spark Streaming job.
 The first executor id always takes the parameters I have provided while 
creating the streaming context but the executor with ID 2 and 3 always override 
the kafka parameters.
    
{code:java}
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this 
application. Enabling Dynamic allocation for Spark Streaming applications can 
cause data loss if Write Ahead Log is not enabled for non-replayable sour    
ces like Flume. See the programming guide for details on how to enable the 
Write Ahead Log.    
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 
write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 
   
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x 
Replicated    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint 
interval = null   
 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 
   20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:             
auto.commit.interval.ms = 5000            
auto.offset.reset = latest            
bootstrap.servers = [1,2,3]            
check.crcs = true            
client.id = client-0            
connections.max.idle.ms = 54            
default.api.timeout.ms = 6            
enable.auto.commit = false            
exclude.internal.topics = true            
fetch.max.bytes = 52428800            
fetch.max.wait.ms = 500            
fetch.min.bytes = 1            
group.id = telemetry-streaming-service            
heartbeat.interval.ms = 3000            
interceptor.classes = []            
internal.leave.group.on.close = true            
isolation.level = read_uncommitted            
key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
 
{code}
Here is the log for other executors.
    
{code:java}
 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    
20/01/14 12:15:04 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
20/01/14 12:15:04 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447    
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external 
shuffle service.    
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)  
  
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1    
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)    
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0   
 
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 
   
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 8.1 KB, free 6.2 GB)    
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 
ms    20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 17.9 KB, free 

[jira] [Updated] (SPARK-30325) markPartitionCompleted cause task status inconsistent

2020-01-15 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30325:

Fix Version/s: 2.4.5

> markPartitionCompleted cause task status inconsistent
> -
>
> Key: SPARK-30325
> URL: https://issues.apache.org/jira/browse/SPARK-30325
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: haiyangyu
>Assignee: haiyangyu
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
> Attachments: image-2019-12-21-17-11-38-565.png, 
> image-2019-12-21-17-15-51-512.png, image-2019-12-21-17-16-40-998.png, 
> image-2019-12-21-17-17-42-244.png
>
>
> h3. Corner case
> The bugs occurs in the coren case as follows:
>  # The stage occurs for fetchFailed and some task hasn't finished, scheduler 
> will resubmit a new stage as retry with those unfinished tasks.
>  # The unfinished task in origin stage finished and the same task on the new 
> retry stage hasn't finished, it will mark the task partition on the new retry 
> stage as succesuful.  !image-2019-12-21-17-11-38-565.png|width=427,height=154!
>  # The executor running those 'successful task' crashed, it cause 
> taskSetManager run executorLost to rescheduler the task on the executor, here 
> will cause copiesRunning decreate 1 twice, beause those 'successful task' are 
> not finished, the variable copiesRunning will decreate to -1 as result. 
> !image-2019-12-21-17-15-51-512.png|width=437,height=340!!image-2019-12-21-17-16-40-998.png|width=398,height=139!
>  # 'dequeueTaskFromList' will use copiesRunning equal 0 as reschedule basis 
> when rescheduler tasks, and now it is -1, can't to reschedule, and the app 
> will hung forever. !image-2019-12-21-17-17-42-244.png|width=366,height=282!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phanikumar updated SPARK-30522:
---
Description: 
I have written a spark streaming consumer to consume the data from Kafka. I 
found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
each partition, an executor is launched by Spark Streaming job.I have written a 
spark streaming consumer to consume the data from Kafka. I found a weird 
behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
an executor is launched by Spark Streaming job.
 The first executor id always takes the parameters I have provided while 
creating the streaming context but the executor with ID 2 and 3 always override 
the kafka parameters.
    
{code:java}
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this 
application. Enabling Dynamic allocation for Spark Streaming applications can 
cause data loss if Write Ahead Log is not enabled for non-replayable sour    
ces like Flume. See the programming guide for details on how to enable the 
Write Ahead Log.    
20/01/14 12:15:05 INFO FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 
write ahead log files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata 
   
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x 
Replicated    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint 
interval = null   
 20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO DirectKafkaInputDStream: Initialized and validated 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated 
   20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    
20/01/14 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    
20/01/14 12:15:05 INFO ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    
20/01/14 12:15:05 INFO ConsumerConfig: ConsumerConfig values:             
auto.commit.interval.ms = 5000            
auto.offset.reset = latest            
bootstrap.servers = [1,2,3]            
check.crcs = true            
client.id = client-0            
connections.max.idle.ms = 54            
default.api.timeout.ms = 6            
enable.auto.commit = false            
exclude.internal.topics = true            
fetch.max.bytes = 52428800            
fetch.max.wait.ms = 500            
fetch.min.bytes = 1            
group.id = telemetry-streaming-service            
heartbeat.interval.ms = 3000            
interceptor.classes = []            
internal.leave.group.on.close = true            
isolation.level = read_uncommitted            
key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
 
{code}
Here is the log for other executors.
    
{code:java}
 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    
20/01/14 12:15:04 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
20/01/14 12:15:04 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:04 INFO BlockManager: external shuffle service port = 7447    
20/01/14 12:15:04 INFO BlockManager: Registering executor with local external 
shuffle service.    
20/01/14 12:15:04 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)  
  
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    
20/01/14 12:15:19 INFO CoarseGrainedExecutorBackend: Got assigned task 1    
20/01/14 12:15:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)    
20/01/14 12:15:19 INFO TorrentBroadcast: Started reading broadcast variable 0   
 
20/01/14 12:15:19 INFO TransportClientFactory: Successfully created connection 
to matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps) 
   
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 8.1 KB, free 6.2 GB)    
20/01/14 12:15:20 INFO TorrentBroadcast: Reading broadcast variable 0 took 163 
ms    20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0 stored as values in 
memory (estimated size 17.9 KB, free 

[jira] [Assigned] (SPARK-30502) PeriodicRDDCheckpointer supports storageLevel

2020-01-15 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30502:


Assignee: zhengruifeng

> PeriodicRDDCheckpointer supports storageLevel
> -
>
> Key: SPARK-30502
> URL: https://issues.apache.org/jira/browse/SPARK-30502
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Intermediate RDDs in ML are cached with 
> storageLevel=StorageLevel.MEMORY_AND_DISK.
> PeriodicRDDCheckpointer will store RDD with 
> storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel 
> of checkpointer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30502) PeriodicRDDCheckpointer supports storageLevel

2020-01-15 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30502.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27189
[https://github.com/apache/spark/pull/27189]

> PeriodicRDDCheckpointer supports storageLevel
> -
>
> Key: SPARK-30502
> URL: https://issues.apache.org/jira/browse/SPARK-30502
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Intermediate RDDs in ML are cached with 
> storageLevel=StorageLevel.MEMORY_AND_DISK.
> PeriodicRDDCheckpointer will store RDD with 
> storageLevel=StorageLevel.MEMORY_ONLY, it maybe nice to set the storageLevel 
> of checkpointer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30518) Precision and scale should be same for values between -1.0 and 1.0 in Decimal

2020-01-15 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30518.
--
Fix Version/s: 3.0.0
 Assignee: wuyi
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27217

> Precision and scale should be same for values between -1.0 and 1.0 in Decimal
> -
>
> Key: SPARK-30518
> URL: https://issues.apache.org/jira/browse/SPARK-30518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, for values between -1.0 and 1.0, precision and scale is 
> inconsistent between Decimal and DecimalType. For example, for numbers like 
> 0.3, it has (precision, scale) as (2, 1) in Decimal, but (1, 1) in 
> DecimalType. We should make Decimal be consistent with DecimalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-15 Thread Ke Jia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30524:
---
Description: The OptimizeSkewedJoin will break the outputPartitioning of 
origin SMJ. And it may introduce additional shuffle after apply the 
OptimizeSkewedJoin. This PR will disable "OptimizeSkewedJoin" rule if 
introducing additional shuffle.

> Disable OptimizeSkewJoin rule if introducing additional shuffle.
> 
>
> Key: SPARK-30524
> URL: https://issues.apache.org/jira/browse/SPARK-30524
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And 
> it may introduce additional shuffle after apply the OptimizeSkewedJoin. This 
> PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.

2020-01-15 Thread Ke Jia (Jira)
Ke Jia created SPARK-30524:
--

 Summary: Disable OptimizeSkewJoin rule if introducing additional 
shuffle.
 Key: SPARK-30524
 URL: https://issues.apache.org/jira/browse/SPARK-30524
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26736) if filter condition `And` has non-determined sub function it does not do partition prunning

2020-01-15 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26736.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27219

> if filter condition `And` has non-determined sub function it does not do 
> partition prunning
> ---
>
> Key: SPARK-26736
> URL: https://issues.apache.org/jira/browse/SPARK-26736
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: roncenzhao
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> Example:
> A  partitioned table definition:
> _create table test(id int) partitioned by (dt string);_
> The following sql does not do partition prunning:
> _select * from test where dt='20190101' and rand() < 0.5;_
>  
> I think it should do partition prunning in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27750) Standalone scheduler - ability to prioritize applications over drivers, many drivers act like Denial of Service

2020-01-15 Thread t oo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989249#comment-16989249
 ] 

t oo edited comment on SPARK-27750 at 1/15/20 11:36 PM:


WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] 
[~zsxwing] [~jlaskowski] [~cloud_fan]


was (Author: toopt4):
WDYT [~Ngone51] [~squito] [~vanzin] [~mgaido] [~jiangxb1987] [~jiangxb] 
[~zsxwing] [~jlaskowski]

> Standalone scheduler - ability to prioritize applications over drivers, many 
> drivers act like Denial of Service
> ---
>
> Key: SPARK-27750
> URL: https://issues.apache.org/jira/browse/SPARK-27750
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Affects Versions: 3.0.0
>Reporter: t oo
>Priority: Minor
>
> If I submit 1000 spark submit drivers then they consume all the cores on my 
> cluster (essentially it acts like a Denial of Service) and no spark 
> 'application' gets to run since the cores are all consumed by the 'drivers'. 
> This feature is about having the ability to prioritize applications over 
> drivers so that at least some 'applications' can start running. I guess it 
> would be like: If (driver.state = 'submitted' and (exists some app.state = 
> 'submitted')) then set app.state = 'running'
> if all apps have app.state = 'running' then set driver.state = 'submitted' 
>  
> Secondary to this, why must a driver consume a minimum of 1 entire core?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30523) Collapse back to back aggregations into a single aggregate to reduce the number of shuffles

2020-01-15 Thread Jason Altekruse (Jira)
Jason Altekruse created SPARK-30523:
---

 Summary: Collapse back to back aggregations into a single 
aggregate to reduce the number of shuffles
 Key: SPARK-30523
 URL: https://issues.apache.org/jira/browse/SPARK-30523
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.0.0
Reporter: Jason Altekruse


Queries containing nested aggregate operations can in some cases be computable 
with a single phase of aggregation. This Jira seeks to introduce a new 
optimizer rule to identify some of those cases and rewrite plans to avoid 
needlessly re-shuffling and generating the aggregation hash table data twice.

Some examples of collapsible aggregates:
{code:java}
SELECT sum(sumAgg) as a, year from (
  select sum(1) as sumAgg, course, year FROM courseSales GROUP BY course, 
year
) group by year

// can be collapsed to
SELECT sum(1) as `a`, year from courseSales group by year
{code}
{code}
SELECT sum(agg), min(a), b from (
 select sum(1) as agg, a, b FROM testData2 GROUP BY a, b
 ) group by b

// can be collapsed to
SELECT sum(1) as `sum(agg)`, min(a) as `min(a)`, b from testData2 group by b
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2020-01-15 Thread Henrique dos Santos Goulart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart closed SPARK-23273.
---

> Spark Dataset withColumn - schema column order isn't the same as case class 
> paramether order
> 
>
> Key: SPARK-23273
> URL: https://issues.apache.org/jira/browse/SPARK-23273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Henrique dos Santos Goulart
>Priority: Major
>
> {code:java}
> case class OnlyAge(age: Int)
> case class NameAge(name: String, age: Int)
> val ds1 = spark.emptyDataset[NameAge]
> val ds2 = spark
>   .createDataset(Seq(OnlyAge(1)))
>   .withColumn("name", lit("henriquedsg89"))
>   .as[NameAge]
> ds1.show()
> ds2.show()
> ds1.union(ds2)
> {code}
>  
> It's going to raise this error:
> {noformat}
> Cannot up cast `age` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "age")
> - root class: "dw.NameAge"{noformat}
> It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
> typed on case class.
> If I change the case class paramether order, it's going to work... like: 
> {code:java}
> case class NameAge(age: Int, name: String){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin updated SPARK-30246:
---
Fix Version/s: (was: 2.4.5)
   2.4.6

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Assignee: Henrique dos Santos Goulart
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-30246:
--

Assignee: Henrique dos Santos Goulart

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Assignee: Henrique dos Santos Goulart
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30246) Spark on Yarn External Shuffle Service Memory Leak

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30246.

Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27064
[https://github.com/apache/spark/pull/27064]

> Spark on Yarn External Shuffle Service Memory Leak
> --
>
> Key: SPARK-30246
> URL: https://issues.apache.org/jira/browse/SPARK-30246
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.3
> Environment: hadoop 2.7.3
> spark 2.4.3
> jdk 1.8.0_60
>Reporter: huangweiyi
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In our large busy yarn cluster which deploy Spark external shuffle service as 
> part of YARN NM aux service, we encountered OOM in some NMs.
> after i dump the heap memory and found there are some StremState objects 
> still in heap, but the app which the StreamState belongs to is already 
> finished.
> Here is some relate Figures:
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_oom.png|width=100%!
> The heap dump below shows that the memory consumption mainly consists of two 
> parts:
> *(1) OneForOneStreamManager (4,429,796,424 (77.11%) bytes)*
> *(2) PoolChunk(occupy 1,059,201,712 (18.44%) bytes. )*
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/nm_heap_overview.png|width=100%!
> dig into the OneForOneStreamManager, there are some StreaStates still 
> remained :
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/streamState.png|width=100%!
> incomming references to StreamState::associatedChannel: 
> !https://raw.githubusercontent.com/012huang/public_source/master/SparkPRFigures/associatedChannel_incomming_reference.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks

2020-01-15 Thread Zebing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016316#comment-17016316
 ] 

Zebing Lin edited comment on SPARK-28403 at 1/15/20 8:53 PM:
-

In our production, this just caused a fluctuation of requested executors:
{code:java}
Total executors: Running = 6, Needed = 6, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
{code}
I think this logic can be deleted.


was (Author: zebingl):
In our production, this just caused a fluctuation of requested executors:
{code:java}
Total executors: Running = 6, Needed = 6, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
Total executors: Running = 6, Needed = 6, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
{code}
I think this logic can be deleted.

> Executor Allocation Manager can add an extra executor when speculative tasks
> 
>
> Key: SPARK-28403
> URL: https://issues.apache.org/jira/browse/SPARK-28403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like SPARK-19326 added a bug in the execuctor allocation maanger 
> where it adds an extra executor when it shouldn't when we have pending 
> speculative tasks but the target number didn't change. 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377]
> It doesn't look like this is necessary since it already added in the 
> pendingSpeculative tasks.
> See the questioning of this on the PR at:
> https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28403) Executor Allocation Manager can add an extra executor when speculative tasks

2020-01-15 Thread Zebing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016316#comment-17016316
 ] 

Zebing Lin commented on SPARK-28403:


In our production, this just caused a fluctuation of requested executors:
{code:java}
Total executors: Running = 6, Needed = 6, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
Total executors: Running = 6, Needed = 6, Requested = 6
Lowering target number of executors to 5 (previously 6) because not all 
requested executors are actually needed
Total executors: Running = 6, Needed = 5, Requested = 6
{code}
I think this logic can be deleted.

> Executor Allocation Manager can add an extra executor when speculative tasks
> 
>
> Key: SPARK-28403
> URL: https://issues.apache.org/jira/browse/SPARK-28403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Thomas Graves
>Priority: Major
>
> It looks like SPARK-19326 added a bug in the execuctor allocation maanger 
> where it adds an extra executor when it shouldn't when we have pending 
> speculative tasks but the target number didn't change. 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L377]
> It doesn't look like this is necessary since it already added in the 
> pendingSpeculative tasks.
> See the questioning of this on the PR at:
> https://github.com/apache/spark/pull/18492/files#diff-b096353602813e47074ace09a3890d56R379



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29448) Support the `INTERVAL` type by Parquet datasource

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-29448.

Resolution: Won't Fix

> Support the `INTERVAL` type by Parquet datasource
> -
>
> Key: SPARK-29448
> URL: https://issues.apache.org/jira/browse/SPARK-29448
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Parquet format allows to store intervals as triple of (milliseconds, days, 
> months) see 
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval 
> . The `INTERVAL` logical type is used for an interval of time. _It must 
> annotate a fixed_len_byte_array of length 12. This array stores three 
> little-endian unsigned integers that represent durations at different 
> granularities of time. The first stores a number in months, the second stores 
> a number in days, and the third stores a number in milliseconds. This 
> representation is independent of any particular timezone or date._
> Need to support writing and reading values of Catalyst's CalendarIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (152 + 3) / 4 = 38.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-28403 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (152 + 3) / 4 = 38.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it 

[jira] [Resolved] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30495.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27191
[https://github.com/apache/spark/pull/27191]

> How to disable 'spark.security.credentials.${service}.enabled' in Structured 
> streaming while connecting to a kafka cluster
> --
>
> Key: SPARK-30495
> URL: https://issues.apache.org/jira/browse/SPARK-30495
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: act_coder
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Trying to read data from a secured Kafka cluster using spark structured
>  streaming. Also, using the below library to read the data -
>  +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to
>  specify our custom group id (instead of spark setting its own custom group
>  id)
> +*Dependency used in code:*+
>         org.apache.spark
>          spark-sql-kafka-0-10_2.12
>          3.0.0-preview
>  
> +*Logs:*+
> Getting the below error - even after specifying the required JAAS
>  configuration in spark options.
> Caused by: java.lang.IllegalArgumentException: requirement failed:
>  *Delegation token must exist for this connector*. at
>  scala.Predef$.require(Predef.scala:281) at
> org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
>  
> +*Spark configuration used to read from Kafka:*+
> val kafkaDF = sparkSession.readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", bootStrapServer)
>  .option("subscribe", kafkaTopic )
>  
> //Setting JAAS Configuration
> .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL)
>  .option("kafka.sasl.mechanism", "PLAIN")
>  .option("kafka.security.protocol", "SASL_SSL")
> // Setting custom consumer group id
> .option("kafka.group.id", "test_cg")
>  .load()
>  
> Following document specifies that we can disable the feature of obtaining
>  delegation token -
>  
> [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html]
> Tried setting this property *spark.security.credentials.kafka.enabled to*
>  *false in spark config,* but it is still failing with the same error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30495) How to disable 'spark.security.credentials.${service}.enabled' in Structured streaming while connecting to a kafka cluster

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-30495:
--

Assignee: Gabor Somogyi

> How to disable 'spark.security.credentials.${service}.enabled' in Structured 
> streaming while connecting to a kafka cluster
> --
>
> Key: SPARK-30495
> URL: https://issues.apache.org/jira/browse/SPARK-30495
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: act_coder
>Assignee: Gabor Somogyi
>Priority: Major
>
> Trying to read data from a secured Kafka cluster using spark structured
>  streaming. Also, using the below library to read the data -
>  +*"spark-sql-kafka-0-10_2.12":"3.0.0-preview"*+ since it has the feature to
>  specify our custom group id (instead of spark setting its own custom group
>  id)
> +*Dependency used in code:*+
>         org.apache.spark
>          spark-sql-kafka-0-10_2.12
>          3.0.0-preview
>  
> +*Logs:*+
> Getting the below error - even after specifying the required JAAS
>  configuration in spark options.
> Caused by: java.lang.IllegalArgumentException: requirement failed:
>  *Delegation token must exist for this connector*. at
>  scala.Predef$.require(Predef.scala:281) at
> org.apache.spark.kafka010.KafkaTokenUtil$.isConnectorUsingCurrentToken(KafkaTokenUtil.scala:299)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:533)
>  at
>  
> org.apache.spark.sql.kafka010.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:275)
>  
> +*Spark configuration used to read from Kafka:*+
> val kafkaDF = sparkSession.readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", bootStrapServer)
>  .option("subscribe", kafkaTopic )
>  
> //Setting JAAS Configuration
> .option("kafka.sasl.jaas.config", KAFKA_JAAS_SASL)
>  .option("kafka.sasl.mechanism", "PLAIN")
>  .option("kafka.security.protocol", "SASL_SSL")
> // Setting custom consumer group id
> .option("kafka.group.id", "test_cg")
>  .load()
>  
> Following document specifies that we can disable the feature of obtaining
>  delegation token -
>  
> [https://spark.apache.org/docs/3.0.0-preview/structured-streaming-kafka-integration.html]
> Tried setting this property *spark.security.credentials.kafka.enabled to*
>  *false in spark config,* but it is still failing with the same error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30190) HistoryServerDiskManager will fail on appStoreDir in s3

2020-01-15 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016263#comment-17016263
 ] 

Steve Loughran commented on SPARK-30190:


S3A creates a dir marker and deletes it

But I'd rather you do the mkdir() call and only if that fails look at the dest 
(getFileStatus) and raise an exception if it isn't a directory.



> HistoryServerDiskManager will fail on appStoreDir in s3
> ---
>
> Key: SPARK-30190
> URL: https://issues.apache.org/jira/browse/SPARK-30190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: thierry accart
>Priority: Major
>
> Hi
> While setting spark.eventLog.dir to s3a://... I realized that it *requires 
> destination directory to preexists for S3* 
> This is explained I think in HistoryServerDiskManager's appStoreDir: it tries 
> check if directory exists or can be created
> {{if (!appStoreDir.isDirectory() && !appStoreDir.mkdir()) \{throw new 
> IllegalArgumentException(s"Failed to create app directory ($appStoreDir).")}}}
> But in S3, a directory does not exists and cannot be created: directories 
> don't exists by themselves, they are only materialized due to existence of 
> objects.
> Before proposing a patch, I wanted to know what are the prefered options : 
> should we have a spark option to skip the appStoreDir test, or skip it only 
> when a particular scheme is set, have a custom implementation of 
> HistoryServerDiskManager ...? 
>  
> _Note for people facing the {{IllegalArgumentException:}} {{Failed to create 
> app directory}} *you just have to put an empty file in bucket destination 
> 'path'*._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 38 executors (see 
attachment), which is exactly (152 + 3) / 4 = 38.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors (see 
attachment).
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with 

[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Attachment: Screen Shot 2020-01-15 at 11.13.17.png

> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
> Attachments: Screen Shot 2020-01-15 at 11.13.17.png
>
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> An easy repro of the issue (`--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
> cluster mode):
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index < 300 && index >= 150) {
> Thread.sleep(index * 1000) // Fake running tasks
> } else if (index == 300) {
> Thread.sleep(1000 * 1000) // Fake long running tasks
> }
> it.toList.map(x => index + ", " + x).iterator
> }).collect
> {code}
> You will see when running the last task, we would be hold 39 executors (see 
> attachment).
> h3. The Bug
> Upon examining the code of _pendingSpeculativeTasks_: 
> {code:java}
> stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
>   numTasks - 
> stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
> }.sum
> {code}
> where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
> _onSpeculativeTaskSubmitted_, but never decremented.  
> _stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
> completion. *This means Spark is marking ended speculative tasks as pending, 
> which leads to Spark to hold more executors that it actually needs!*
> I will have a PR ready to fix this issue, along with SPARK-2840 too
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors:
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors:

!image-2020-01-15-11-09-29-215.png!
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> 

[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors (see 
attachment).
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors:
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> 

[jira] [Updated] (SPARK-30511) Spark marks ended speculative tasks as pending leads to holding idle executors

2020-01-15 Thread Zebing Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zebing Lin updated SPARK-30511:
---
Description: 
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.

An easy repro of the issue (`--conf spark.speculation=true --conf 
spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
cluster mode):
{code:java}
val n = 4000
val someRDD = sc.parallelize(1 to n, n)
someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
if (index < 300 && index >= 150) {
Thread.sleep(index * 1000) // Fake running tasks
} else if (index == 300) {
Thread.sleep(1000 * 1000) // Fake long running tasks
}
it.toList.map(x => index + ", " + x).iterator
}).collect
{code}
You will see when running the last task, we would be hold 39 executors:

!image-2020-01-15-11-09-29-215.png!
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 

  was:
*TL;DR*
 When speculative tasks finished/failed/got killed, they are still considered 
as pending and count towards the calculation of number of needed executors.
h3. Symptom

In one of our production job (where it's running 4 tasks per executor), we 
found that it was holding 6 executors at the end with only 2 tasks running (1 
speculative). With more logging enabled, we found the job printed:
{code:java}
pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
{code}
 while the job only had 1 speculative task running and 16 speculative tasks 
intentionally killed because of corresponding original tasks had finished.
h3. The Bug

Upon examining the code of _pendingSpeculativeTasks_: 
{code:java}
stageAttemptToNumSpeculativeTasks.map { case (stageAttempt, numTasks) =>
  numTasks - 
stageAttemptToSpeculativeTaskIndices.get(stageAttempt).map(_.size).getOrElse(0)
}.sum
{code}
where _stageAttemptToNumSpeculativeTasks(stageAttempt)_ is incremented on 
_onSpeculativeTaskSubmitted_, but never decremented.  
_stageAttemptToNumSpeculativeTasks -= stageAttempt_ is performed on stage 
completion. *This means Spark is marking ended speculative tasks as pending, 
which leads to Spark to hold more executors that it actually needs!*

I will have a PR ready to fix this issue, along with SPARK-2840 too

 

 

 


> Spark marks ended speculative tasks as pending leads to holding idle executors
> --
>
> Key: SPARK-30511
> URL: https://issues.apache.org/jira/browse/SPARK-30511
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Zebing Lin
>Priority: Major
>
> *TL;DR*
>  When speculative tasks finished/failed/got killed, they are still considered 
> as pending and count towards the calculation of number of needed executors.
> h3. Symptom
> In one of our production job (where it's running 4 tasks per executor), we 
> found that it was holding 6 executors at the end with only 2 tasks running (1 
> speculative). With more logging enabled, we found the job printed:
> {code:java}
> pendingTasks is 0 pendingSpeculativeTasks is 17 totalRunningTasks is 2
> {code}
>  while the job only had 1 speculative task running and 16 speculative tasks 
> intentionally killed because of corresponding original tasks had finished.
> An easy repro of the issue (`--conf spark.speculation=true --conf 
> spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=1000` in 
> cluster mode):
> {code:java}
> val n = 4000
> val someRDD = sc.parallelize(1 to n, n)
> someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => {
> if (index < 300 && index >= 150) {
> 

[jira] [Resolved] (SPARK-30479) Apply compaction of event log to SQL events

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30479.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27164
[https://github.com/apache/spark/pull/27164]

> Apply compaction of event log to SQL events
> ---
>
> Key: SPARK-30479
> URL: https://issues.apache.org/jira/browse/SPARK-30479
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue is to track the effort on compacting old event logs (and cleaning 
> up after compaction) without breaking guaranteeing of compatibility.
> This issue depends on SPARK-29779 and focuses on dealing with SQL events.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30479) Apply compaction of event log to SQL events

2020-01-15 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-30479:
--

Assignee: Jungtaek Lim

> Apply compaction of event log to SQL events
> ---
>
> Key: SPARK-30479
> URL: https://issues.apache.org/jira/browse/SPARK-30479
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> This issue is to track the effort on compacting old event logs (and cleaning 
> up after compaction) without breaking guaranteeing of compatibility.
> This issue depends on SPARK-29779 and focuses on dealing with SQL events.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2020-01-15 Thread Sowmya (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016213#comment-17016213
 ] 

Sowmya commented on SPARK-18165:


Hi [~itsvikramagr] ,Can this library be used in production with spark 2.4 code 
rather than command line? I tried to use it with spark code(sbt in intellij) I 
tried to use it with this code and see an error "Cannot find data source 
kinesis".

val spark = SparkSession.builder
 .master("local")
 .appName("Spark Kinesis Connector")
 .getOrCreate()
 println(spark)

val kinesis = spark
 .readStream
 .format("kinesis")
 .option("streamName", "stream name")
 .option("endpointUrl", "kinesis endpoint")
 .option("startingposition", "TRIM_HORIZON")
 .load

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles

2020-01-15 Thread ShivaKumar SS (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17016200#comment-17016200
 ] 

ShivaKumar SS commented on SPARK-26570:
---

Unfortunately I have hit the same issue. 

[https://stackoverflow.com/questions/59757268/spark-read-multiple-column-partitioned-csv-files-out-of-memory]

Is there any workaround for this. ? 

 

> Out of memory when InMemoryFileIndex bulkListLeafFiles
> --
>
> Key: SPARK-26570
> URL: https://issues.apache.org/jira/browse/SPARK-26570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: deshanxiao
>Priority: Major
> Attachments: image-2019-10-13-18-41-22-090.png, 
> image-2019-10-13-18-45-33-770.png, image-2019-10-14-10-00-27-361.png, 
> image-2019-10-14-10-32-17-949.png, image-2019-10-14-10-47-47-684.png, 
> image-2019-10-14-10-50-47-567.png, image-2019-10-14-10-51-28-374.png, 
> screenshot-1.png
>
>
> The *bulkListLeafFiles* will collect all filestatus in memory for every query 
> which may cause the oom of driver. I use the spark 2.3.2 meeting with the 
> problem. Maybe the latest one also exists the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30522) Spark Streaming dynamic executors override or take default kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

phanikumar updated SPARK-30522:
---
Summary: Spark Streaming dynamic executors override or take default kafka 
parameters in cluster mode  (was: Spark Streaming dynamic executors overried 
kafka parameters in cluster mode)

> Spark Streaming dynamic executors override or take default kafka parameters 
> in cluster mode
> ---
>
> Key: SPARK-30522
> URL: https://issues.apache.org/jira/browse/SPARK-30522
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.2
>Reporter: phanikumar
>Priority: Major
>
> I have written a spark streaming consumer to consume the data from Kafka. I 
> found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
> each partition, an executor is launched by Spark Streaming job.I have written 
> a spark streaming consumer to consume the data from Kafka. I found a weird 
> behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
> an executor is launched by Spark Streaming job.
> The first executor id always takes the parameters I have provided while 
> creating the streaming context but the executor with ID 2 and 3 always 
> override the kafka parameters.
>    
> {code:java}
> 20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for 
> this application. Enabling Dynamic allocation for Spark Streaming 
> applications can cause data loss if Write Ahead Log is not enabled for 
> non-replayable sour    ces like Flume. See the programming guide for details 
> on how to enable the Write Ahead Log.    20/01/14 12:15:05 INFO 
> FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log 
> files from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata    20/01/14 
> 12:15:05 INFO DirectKafkaInputDStream: Slide time = 5000 ms    20/01/14 
> 12:15:05 INFO DirectKafkaInputDStream: Storage level = Serialized 1x 
> Replicated    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Checkpoint 
> interval = null    20/01/14 12:15:05 INFO DirectKafkaInputDStream: Remember 
> interval = 5000 ms    20/01/14 12:15:05 INFO DirectKafkaInputDStream: 
> Initialized and validated 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
> 20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    20/01/14 
> 12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated    
> 20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    20/01/14 
> 12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    20/01/14 
> 12:15:05 INFO ForEachDStream: Initialized and validated 
> org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    20/01/14 
> 12:15:05 INFO ConsumerConfig: ConsumerConfig values:             
> auto.commit.interval.ms = 5000            auto.offset.reset = latest          
>   bootstrap.servers = [1,2,3]            check.crcs = true            
> client.id = client-0            connections.max.idle.ms = 54            
> default.api.timeout.ms = 6            enable.auto.commit = false          
>   exclude.internal.topics = true            fetch.max.bytes = 52428800        
>     fetch.max.wait.ms = 500            fetch.min.bytes = 1            
> group.id = telemetry-streaming-service            heartbeat.interval.ms = 
> 3000            interceptor.classes = []            
> internal.leave.group.on.close = true            isolation.level = 
> read_uncommitted            key.deserializer = class 
> org.apache.kafka.common.serialization.StringDeserializer
>  
> {code}
> Here is the log for other executors.
>    
> {code:java}
>  20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    
> 20/01/14 12:15:04 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
> 20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
> 20/01/14 12:15:04 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(2, matrix-hwork-data-05, 40324, None)    20/01/14 12:15:04 
> INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, 
> matrix-hwork-data-05, 40324, None)    20/01/14 12:15:04 INFO BlockManager: 
> external shuffle service port = 7447    20/01/14 12:15:04 INFO BlockManager: 
> Registering executor with local external shuffle service.    20/01/14 
> 12:15:04 INFO TransportClientFactory: Successfully created connection to 
> matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)   
>  20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
> 

[jira] [Created] (SPARK-30522) Spark Streaming dynamic executors overried kafka parameters in cluster mode

2020-01-15 Thread phanikumar (Jira)
phanikumar created SPARK-30522:
--

 Summary: Spark Streaming dynamic executors overried kafka 
parameters in cluster mode
 Key: SPARK-30522
 URL: https://issues.apache.org/jira/browse/SPARK-30522
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.3.2
Reporter: phanikumar


I have written a spark streaming consumer to consume the data from Kafka. I 
found a weird behavior in my logs. The Kafka topic has 3 partitions and for 
each partition, an executor is launched by Spark Streaming job.I have written a 
spark streaming consumer to consume the data from Kafka. I found a weird 
behavior in my logs. The Kafka topic has 3 partitions and for each partition, 
an executor is launched by Spark Streaming job.
The first executor id always takes the parameters I have provided while 
creating the streaming context but the executor with ID 2 and 3 always override 
the kafka parameters.
   
{code:java}
20/01/14 12:15:05 WARN StreamingContext: Dynamic Allocation is enabled for this 
application. Enabling Dynamic allocation for Spark Streaming applications can 
cause data loss if Write Ahead Log is not enabled for non-replayable sour    
ces like Flume. See the programming guide for details on how to enable the 
Write Ahead Log.    20/01/14 12:15:05 INFO 
FileBasedWriteAheadLog_ReceivedBlockTracker: Recovered 2 write ahead log files 
from hdfs://tlabnamenode/checkpoint/receivedBlockMetadata    20/01/14 12:15:05 
INFO DirectKafkaInputDStream: Slide time = 5000 ms    20/01/14 12:15:05 INFO 
DirectKafkaInputDStream: Storage level = Serialized 1x Replicated    20/01/14 
12:15:05 INFO DirectKafkaInputDStream: Checkpoint interval = null    20/01/14 
12:15:05 INFO DirectKafkaInputDStream: Remember interval = 5000 ms    20/01/14 
12:15:05 INFO DirectKafkaInputDStream: Initialized and validated 
org.apache.spark.streaming.kafka010.DirectKafkaInputDStream@12665f3f    
20/01/14 12:15:05 INFO ForEachDStream: Slide time = 5000 ms    20/01/14 
12:15:05 INFO ForEachDStream: Storage level = Serialized 1x Replicated    
20/01/14 12:15:05 INFO ForEachDStream: Checkpoint interval = null    20/01/14 
12:15:05 INFO ForEachDStream: Remember interval = 5000 ms    20/01/14 12:15:05 
INFO ForEachDStream: Initialized and validated 
org.apache.spark.streaming.dstream.ForEachDStream@a4d83ac    20/01/14 12:15:05 
INFO ConsumerConfig: ConsumerConfig values:             auto.commit.interval.ms 
= 5000            auto.offset.reset = latest            bootstrap.servers = 
[1,2,3]            check.crcs = true            client.id = client-0            
connections.max.idle.ms = 54            default.api.timeout.ms = 6      
      enable.auto.commit = false            exclude.internal.topics = true      
      fetch.max.bytes = 52428800            fetch.max.wait.ms = 500            
fetch.min.bytes = 1            group.id = telemetry-streaming-service           
 heartbeat.interval.ms = 3000            interceptor.classes = []            
internal.leave.group.on.close = true            isolation.level = 
read_uncommitted            key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
 
{code}
Here is the log for other executors.
   
{code:java}
 20/01/14 12:15:04 INFO Executor: Starting executor ID 2 on host 1    20/01/14 
12:15:04 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 40324.    
20/01/14 12:15:04 INFO NettyBlockTransferService: Server created on 1    
20/01/14 12:15:04 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy    20/01/14 12:15:04 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    20/01/14 12:15:04 INFO 
BlockManagerMaster: Registered BlockManager BlockManagerId(2, 
matrix-hwork-data-05, 40324, None)    20/01/14 12:15:04 INFO BlockManager: 
external shuffle service port = 7447    20/01/14 12:15:04 INFO BlockManager: 
Registering executor with local external shuffle service.    20/01/14 12:15:04 
INFO TransportClientFactory: Successfully created connection to 
matrix-hwork-data-05/10.83.34.25:7447 after 1 ms (0 ms spent in bootstraps)    
20/01/14 12:15:04 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(2, matrix-hwork-data-05, 40324, None)    20/01/14 12:15:19 INFO 
CoarseGrainedExecutorBackend: Got assigned task 1    20/01/14 12:15:19 INFO 
Executor: Running task 1.0 in stage 0.0 (TID 1)    20/01/14 12:15:19 INFO 
TorrentBroadcast: Started reading broadcast variable 0    20/01/14 12:15:19 
INFO TransportClientFactory: Successfully created connection to 
matrix-hwork-data-05/10.83.34.25:38759 after 2 ms (0 ms spent in bootstraps)    
20/01/14 12:15:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in 
memory (estimated size 8.1 KB, free 6.2 GB)    20/01/14 

[jira] [Resolved] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol

2020-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30504.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27190
[https://github.com/apache/spark/pull/27190]

> OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
> 
>
> Key: SPARK-30504
> URL: https://issues.apache.org/jira/browse/SPARK-30504
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.0.0
>
>
> Current behaviour
> {code:python}
> from pyspark.ml.classification import LogisticRegression, OneVsRest, 
> OneVsRestModel
> from pyspark.ml.linalg import DenseVector
> df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, 
> DenseVector([1.0, 0.0]))], ("label", "w", "features"))
> ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
> ovrm = ovr.fit(df)
> ovr.getWeightCol()
> ## 'w'
> ovrm.getWeightCol()
> ## 'w'
> ovr.write().overwrite().save("/tmp/ovr")
> ovr_ = OneVsRest.load("/tmp/ovr")
> ovr_.getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', 
> doc='weight column name. ...)
> ovrm.write().overwrite().save("/tmp/ovrm")
> ovrm_ = OneVsRestModel.load("/tmp/ovrm")
> ovrm_ .getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', 
> doc='weight column name ...
> {code}
> Expected behaviour:
> {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have 
> {{weightCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol

2020-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30504:
-
Priority: Minor  (was: Major)

> OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
> 
>
> Key: SPARK-30504
> URL: https://issues.apache.org/jira/browse/SPARK-30504
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.0.0
>
>
> Current behaviour
> {code:python}
> from pyspark.ml.classification import LogisticRegression, OneVsRest, 
> OneVsRestModel
> from pyspark.ml.linalg import DenseVector
> df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, 
> DenseVector([1.0, 0.0]))], ("label", "w", "features"))
> ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
> ovrm = ovr.fit(df)
> ovr.getWeightCol()
> ## 'w'
> ovrm.getWeightCol()
> ## 'w'
> ovr.write().overwrite().save("/tmp/ovr")
> ovr_ = OneVsRest.load("/tmp/ovr")
> ovr_.getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', 
> doc='weight column name. ...)
> ovrm.write().overwrite().save("/tmp/ovrm")
> ovrm_ = OneVsRestModel.load("/tmp/ovrm")
> ovrm_ .getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', 
> doc='weight column name ...
> {code}
> Expected behaviour:
> {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have 
> {{weightCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30504) OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol

2020-01-15 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30504:


Assignee: Maciej Szymkiewicz

> OneVsRest and OneVsRestModel _from_java and _to_java should handle weightCol
> 
>
> Key: SPARK-30504
> URL: https://issues.apache.org/jira/browse/SPARK-30504
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Current behaviour
> {code:python}
> from pyspark.ml.classification import LogisticRegression, OneVsRest, 
> OneVsRestModel
> from pyspark.ml.linalg import DenseVector
> df = spark.createDataFrame([(0, 1, DenseVector([1.0, 0.0])), (0, 1, 
> DenseVector([1.0, 0.0]))], ("label", "w", "features"))
> ovr = OneVsRest(classifier=LogisticRegression()).setWeightCol("w")
> ovrm = ovr.fit(df)
> ovr.getWeightCol()
> ## 'w'
> ovrm.getWeightCol()
> ## 'w'
> ovr.write().overwrite().save("/tmp/ovr")
> ovr_ = OneVsRest.load("/tmp/ovr")
> ovr_.getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRest_5145d56b6bd1', name='weightCol', 
> doc='weight column name. ...)
> ovrm.write().overwrite().save("/tmp/ovrm")
> ovrm_ = OneVsRestModel.load("/tmp/ovrm")
> ovrm_ .getWeightCol()
> ## KeyError   
> ## ...
> ## KeyError: Param(parent='OneVsRestModel_598c6d900fad', name='weightCol', 
> doc='weight column name ...
> {code}
> Expected behaviour:
> {{OneVsRest}} and {{OneVsRestModel}} loaded from disk should have 
> {{weightCol}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30521) Eliminate deprecation warnings for ExpressionInfo

2020-01-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30521:
--

 Summary: Eliminate deprecation warnings for ExpressionInfo
 Key: SPARK-30521
 URL: https://issues.apache.org/jira/browse/SPARK-30521
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


{code}
Warning:(335, 5) constructor ExpressionInfo in class ExpressionInfo is 
deprecated: see corresponding Javadoc for more information.
new ExpressionInfo("noClass", "myDb", "myFunction", "usage", "extended 
usage"),
Warning:(732, 5) constructor ExpressionInfo in class ExpressionInfo is 
deprecated: see corresponding Javadoc for more information.
new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
usage"),
Warning:(751, 5) constructor ExpressionInfo in class ExpressionInfo is 
deprecated: see corresponding Javadoc for more information.
new ExpressionInfo("noClass", "myDb", "myFunction2", "usage", "extended 
usage"),
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30520) Eliminate deprecation warnings for UserDefinedAggregateFunction

2020-01-15 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30520:
--

 Summary: Eliminate deprecation warnings for 
UserDefinedAggregateFunction
 Key: SPARK-30520
 URL: https://issues.apache.org/jira/browse/SPARK-30520
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


{code}
/Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/UDFRegistration.scala
Warning:Warning:line (718)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
  val udaf = 
clazz.getConstructor().newInstance().asInstanceOf[UserDefinedAggregateFunction]
Warning:Warning:line (719)method register in class UDFRegistration is 
deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as 
a UDF via the functions.udaf(agg) method.
  register(name, udaf)
/Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/udaf.scala
Warning:Warning:line (328)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
udaf: UserDefinedAggregateFunction,
Warning:Warning:line (326)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
case class ScalaUDAF(
/Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala
Warning:Warning:line (363)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
val udaf = new UserDefinedAggregateFunction {
/Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleSum.java
Warning:Warning:line (25)java: 
org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
org.apache.spark.sql.expressions has been deprecated
Warning:Warning:line (35)java: 
org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
org.apache.spark.sql.expressions has been deprecated
/Users/maxim/proj/eliminate-expr-info-warnings/sql/core/src/test/java/test/org/apache/spark/sql/MyDoubleAvg.java
Warning:Warning:line (25)java: 
org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
org.apache.spark.sql.expressions has been deprecated
Warning:Warning:line (36)java: 
org.apache.spark.sql.expressions.UserDefinedAggregateFunction in 
org.apache.spark.sql.expressions has been deprecated
/Users/maxim/proj/eliminate-expr-info-warnings/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/AggregationQuerySuite.scala
Warning:Warning:line (36)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
class ScalaAggregateFunction(schema: StructType) extends 
UserDefinedAggregateFunction {
Warning:Warning:line (73)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
class ScalaAggregateFunctionWithoutInputSchema extends 
UserDefinedAggregateFunction {
Warning:Warning:line (100)class UserDefinedAggregateFunction in package 
expressions is deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be 
registered as a UDF via the functions.udaf(agg) method.
class LongProductSum extends UserDefinedAggregateFunction {
Warning:Warning:line (189)method register in class UDFRegistration is 
deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as 
a UDF via the functions.udaf(agg) method.
spark.udf.register("mydoublesum", new MyDoubleSum)
Warning:Warning:line (190)method register in class UDFRegistration is 
deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as 
a UDF via the functions.udaf(agg) method.
spark.udf.register("mydoubleavg", new MyDoubleAvg)
Warning:Warning:line (191)method register in class UDFRegistration is 
deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as 
a UDF via the functions.udaf(agg) method.
spark.udf.register("longProductSum", new LongProductSum)
Warning:Warning:line (943)method register in class UDFRegistration is 
deprecated (since 3.0.0): Aggregator[IN, BUF, OUT] should now be registered as 
a UDF via the functions.udaf(agg) method.
  spark.udf.register("noInputSchema", new 
ScalaAggregateFunctionWithoutInputSchema)

[jira] [Updated] (SPARK-30165) Eliminate compilation warnings

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30165:
---
Attachment: spark_warnings.txt

> Eliminate compilation warnings
> --
>
> Key: SPARK-30165
> URL: https://issues.apache.org/jira/browse/SPARK-30165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: spark_warnings.txt
>
>
> This is an umbrella ticket for sub-tasks for eliminating compilation 
> warnings.  I dumped all warnings to the spark_warnings.txt file attached to 
> the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30165) Eliminate compilation warnings

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30165:
---
Attachment: (was: spark_warnings.txt)

> Eliminate compilation warnings
> --
>
> Key: SPARK-30165
> URL: https://issues.apache.org/jira/browse/SPARK-30165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: spark_warnings.txt
>
>
> This is an umbrella ticket for sub-tasks for eliminating compilation 
> warnings.  I dumped all warnings to the spark_warnings.txt file attached to 
> the ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29708) Different answers in aggregates of duplicate grouping sets

2020-01-15 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29708.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/26961

> Different answers in aggregates of duplicate grouping sets
> --
>
> Key: SPARK-29708
> URL: https://issues.apache.org/jira/browse/SPARK-29708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> A query below with multiple grouping sets seems to have different answers 
> between PgSQL and Spark;
> {code:java}
> postgres=# create table gstest4(id integer, v integer, unhashable_col bit(4), 
> unsortable_col xid);
> postgres=# insert into gstest4
> postgres-# values (1,1,b'','1'), (2,2,b'0001','1'),
> postgres-#(3,4,b'0010','2'), (4,8,b'0011','2'),
> postgres-#(5,16,b'','2'), (6,32,b'0001','2'),
> postgres-#(7,64,b'0010','1'), (8,128,b'0011','1');
> INSERT 0 8
> postgres=# select unsortable_col, count(*)
> postgres-#   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
> postgres-#   order by text(unsortable_col);
>  unsortable_col | count 
> +---
>   1 | 8
>   1 | 8
>   2 | 8
>   2 | 8
> (4 rows)
> {code}
> {code:java}
> scala> sql("""create table gstest4(id integer, v integer, unhashable_col /* 
> bit(4) */ byte, unsortable_col /* xid */ integer) using parquet""")
> scala> sql("""
>  | insert into gstest4
>  | values (1,1,tinyint('0'),1), (2,2,tinyint('1'),1),
>  |(3,4,tinyint('2'),2), (4,8,tinyint('3'),2),
>  |(5,16,tinyint('0'),2), (6,32,tinyint('1'),2),
>  |(7,64,tinyint('2'),1), (8,128,tinyint('3'),1)
>  | """)
> res21: org.apache.spark.sql.DataFrame = []
> scala> 
> scala> sql("""
>  | select unsortable_col, count(*)
>  |   from gstest4 group by grouping sets 
> ((unsortable_col),(unsortable_col))
>  |   order by string(unsortable_col)
>  | """).show
> +--++
> |unsortable_col|count(1)|
> +--++
> | 1|   8|
> | 2|   8|
> +--++
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30515) Refactor SimplifyBinaryComparison to reduce the time complexity

2020-01-15 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30515.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/27212

> Refactor SimplifyBinaryComparison to reduce the time complexity
> ---
>
> Key: SPARK-30515
> URL: https://issues.apache.org/jira/browse/SPARK-30515
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> The improvement of the rule SimplifyBinaryComparison in PR 
> https://github.com/apache/spark/pull/27008 could bring performance regression 
> in the optimizer. 
> We need to improve the implementation and reduce the time complexity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30177) Eliminate warnings: part 7

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30177:
---
Summary: Eliminate warnings: part 7  (was: Eliminate warnings: part7)

> Eliminate warnings: part 7
> --
>
> Key: SPARK-30177
> URL: https://issues.apache.org/jira/browse/SPARK-30177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> /mllib/src/test/scala/org/apache/spark/ml/clustering/BisectingKMeansSuite.scala
> Warning:Warning:line (108)method computeCost in class 
> BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated 
> and will be removed in future versions. Use ClusteringEvaluator instead. You 
> can also get the cost on the training dataset in the summary.
> assert(model.computeCost(dataset) < 0.1)
> Warning:Warning:line (135)method computeCost in class 
> BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated 
> and will be removed in future versions. Use ClusteringEvaluator instead. You 
> can also get the cost on the training dataset in the summary.
> assert(model.computeCost(dataset) == summary.trainingCost)
> Warning:Warning:line (195)method computeCost in class 
> BisectingKMeansModel is deprecated (since 3.0.0): This method is deprecated 
> and will be removed in future versions. Use ClusteringEvaluator instead. You 
> can also get the cost on the training dataset in the summary.
>   model.computeCost(dataset)
> 
> /sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
> Warning:Warning:line (105)Java enum ALLOW_UNQUOTED_CONTROL_CHARS in Java 
> enum Feature is deprecated: see corresponding Javadoc for more information.
>   jsonFactory.enable(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS)
> /sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java
> Warning:Warning:line (28)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (37)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (46)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (55)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (64)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> /sql/core/src/test/java/test/org/apache/spark/sql/JavaTestUtils.java
> Information:Information:java: 
> /Users/maxim/proj/eliminate-warning/sql/core/src/test/java/test/org/apache/spark/sql/JavaTestUtils.java
>  uses unchecked or unsafe operations.
> Information:Information:java: Recompile with -Xlint:unchecked for details.
> /sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
> Warning:Warning:line (478)java: 
> json(org.apache.spark.api.java.JavaRDD) in 
> org.apache.spark.sql.DataFrameReader has been deprecated



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30172) Eliminate warnings: part 3

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30172:
---
Summary: Eliminate warnings: part 3  (was: Eliminate warnings: part3)

> Eliminate warnings: part 3
> --
>
> Key: SPARK-30172
> URL: https://issues.apache.org/jira/browse/SPARK-30172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> /sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
> Warning:Warning:line (422)method initialize in class AbstractSerDe is 
> deprecated: see corresponding Javadoc for more information.
> serde.initialize(null, properties)
> /sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala
> Warning:Warning:line (216)method initialize in class GenericUDTF is 
> deprecated: see corresponding Javadoc for more information.
>   protected lazy val outputInspector = 
> function.initialize(inputInspectors.toArray)
> Warning:Warning:line (342)class UDAF in package exec is deprecated: see 
> corresponding Javadoc for more information.
>   new GenericUDAFBridge(funcWrapper.createFunction[UDAF]())
> Warning:Warning:line (503)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> def serialize(buffer: AggregationBuffer): Array[Byte] = {
> Warning:Warning:line (523)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> def deserialize(bytes: Array[Byte]): AggregationBuffer = {
> Warning:Warning:line (538)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)
> Warning:Warning:line (538)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)
> /sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkOrcNewRecordReader.java
> Warning:Warning:line (44)java: getTypes() in org.apache.orc.Reader has 
> been deprecated
> Warning:Warning:line (47)java: getTypes() in org.apache.orc.Reader has 
> been deprecated
> /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
> Warning:Warning:line (2,368)method readFooter in class ParquetFileReader 
> is deprecated: see corresponding Javadoc for more information.
> val footer = ParquetFileReader.readFooter(
> /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDAFSuite.scala
> Warning:Warning:line (202)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def getNewAggregationBuffer: AggregationBuffer = new 
> MockUDAFBuffer(0L, 0L)
> Warning:Warning:line (204)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def reset(agg: AggregationBuffer): Unit = {
> Warning:Warning:line (212)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): 
> Unit = {
> Warning:Warning:line (221)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def merge(agg: AggregationBuffer, partial: Object): Unit = {
> Warning:Warning:line (231)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def terminatePartial(agg: AggregationBuffer): AnyRef = {
> Warning:Warning:line (236)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def terminate(agg: AggregationBuffer): AnyRef = 
> terminatePartial(agg)
> Warning:Warning:line (257)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def getNewAggregationBuffer: AggregationBuffer = {
> Warning:Warning:line (266)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def reset(agg: AggregationBuffer): Unit = {
> Warning:Warning:line (277)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): 
> Unit = {

[jira] [Updated] (SPARK-30174) Eliminate warnings: part 4

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30174:
---
Summary: Eliminate warnings: part 4  (was: Eliminate warnings :part 4)

> Eliminate warnings: part 4
> --
>
> Key: SPARK-30174
> URL: https://issues.apache.org/jira/browse/SPARK-30174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
> {code:java}
> Warning:Warning:line (127)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> Warning:Warning:line (261)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
> new org.apache.parquet.hadoop.ParquetInputSplit(
> Warning:Warning:line (272)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
> ParquetFileReader.readFooter(sharedConf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> Warning:Warning:line (442)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
>   ParquetFileReader.readFooter(
> {code}
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala
> {code:java}
>  Warning:Warning:line (91)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30171) Eliminate warnings: part 2

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30171:
---
Summary: Eliminate warnings: part 2  (was: Eliminate warnings: part2)

> Eliminate warnings: part 2
> --
>
> Key: SPARK-30171
> URL: https://issues.apache.org/jira/browse/SPARK-30171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> AvroFunctionsSuite.scala
> Warning:Warning:line (41)method to_avro in package avro is deprecated (since 
> 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (41)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (59)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (70)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df)
> Warning:Warning:line (76)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (118)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val readBackOne = dfOne.select(to_avro($"array").as("avro"))
> Warning:Warning:line (119)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
>   .select(from_avro($"avro", avroTypeArrStruct).as("array"))
> AvroPartitionReaderFactory.scala
> Warning:Warning:line (64)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> if (parsedOptions.ignoreExtension || 
> partitionedFile.filePath.endsWith(".avro")) {
> AvroFileFormat.scala
> Warning:Warning:line (98)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
>   if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) {
> AvroUtils.scala
> Warning:Warning:line (55)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30174) Eliminate warnings :part 4

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-30174.

Resolution: Duplicate

> Eliminate warnings :part 4
> --
>
> Key: SPARK-30174
> URL: https://issues.apache.org/jira/browse/SPARK-30174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
> {code:java}
> Warning:Warning:line (127)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> Warning:Warning:line (261)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
> new org.apache.parquet.hadoop.ParquetInputSplit(
> Warning:Warning:line (272)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
> ParquetFileReader.readFooter(sharedConf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> Warning:Warning:line (442)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
>   ParquetFileReader.readFooter(
> {code}
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala
> {code:java}
>  Warning:Warning:line (91)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30171) Eliminate warnings: part2

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-30171.

Resolution: Won't Fix

As per discussion in 
[https://github.com/apache/spark/pull/27174#discussion_r365629991] , the 
warning exists till Spark 3.0

> Eliminate warnings: part2
> -
>
> Key: SPARK-30171
> URL: https://issues.apache.org/jira/browse/SPARK-30171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> AvroFunctionsSuite.scala
> Warning:Warning:line (41)method to_avro in package avro is deprecated (since 
> 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (41)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (59)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (70)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df)
> Warning:Warning:line (76)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (118)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val readBackOne = dfOne.select(to_avro($"array").as("avro"))
> Warning:Warning:line (119)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
>   .select(from_avro($"avro", avroTypeArrStruct).as("array"))
> AvroPartitionReaderFactory.scala
> Warning:Warning:line (64)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> if (parsedOptions.ignoreExtension || 
> partitionedFile.filePath.endsWith(".avro")) {
> AvroFileFormat.scala
> Warning:Warning:line (98)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
>   if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) {
> AvroUtils.scala
> Warning:Warning:line (55)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30490) Eliminate warnings in Avro datasource

2020-01-15 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-30490.

Resolution: Won't Fix

See [https://github.com/apache/spark/pull/27174#discussion_r365629991]

> Eliminate warnings in Avro datasource
> -
>
> Key: SPARK-30490
> URL: https://issues.apache.org/jira/browse/SPARK-30490
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is the list of deprecation warning in Avro:
> {code}
> /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala
> Warning:Warning:line (98)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
>   if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) {
> /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroUtils.scala
> Warning:Warning:line (55)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,
> /Users/maxim/proj/eliminate-warning/external/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala
> Warning:Warning:line (64)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> if (parsedOptions.ignoreExtension || 
> partitionedFile.filePath.endsWith(".avro")) {
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30519) Executor can't use spark.executorEnv.HADOOP_USER_NAME to change the user accessing to hdfs

2020-01-15 Thread Xiaoming (Jira)
Xiaoming created SPARK-30519:


 Summary: Executor can't use spark.executorEnv.HADOOP_USER_NAME to 
change the user accessing to hdfs
 Key: SPARK-30519
 URL: https://issues.apache.org/jira/browse/SPARK-30519
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Xiaoming


Currently, we can specify hadoop user by setting HADOOP_USER_NAME on driver 
when submit a job. However it's invalid to executor by setting 
spark.executorEnv.HADOOP_USER_NAME.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30518) Precision and scale should be same for values between -1.0 and 1.0 in Decimal

2020-01-15 Thread wuyi (Jira)
wuyi created SPARK-30518:


 Summary: Precision and scale should be same for values between 
-1.0 and 1.0 in Decimal
 Key: SPARK-30518
 URL: https://issues.apache.org/jira/browse/SPARK-30518
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Currently, for values between -1.0 and 1.0, precision and scale is inconsistent 
between Decimal and DecimalType. For example, for numbers like 0.3, it has 
(precision, scale) as (2, 1) in Decimal, but (1, 1) in DecimalType. We should 
make Decimal be consistent with DecimalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30517) Support SHOW TABLES EXTENDED

2020-01-15 Thread Ajith S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015706#comment-17015706
 ] 

Ajith S edited comment on SPARK-30517 at 1/15/20 8:01 AM:
--

[~srowen] [~dongjoon] [~vanzin] Please let me know your opinions about 
proposal. I would like to work if its acceptable


was (Author: ajithshetty):
[~srowen] [~dongjoon] [~vanzin] Please let me know about your opinion about 
proposal. I would like to work if its acceptable

> Support SHOW TABLES EXTENDED
> 
>
> Key: SPARK-30517
> URL: https://issues.apache.org/jira/browse/SPARK-30517
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> {{Intention is to support show tables with a additional column 'type' where 
> type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of 
> required types, like listing only views or only external tables (using a 
> 'where' clause over 'type' column).}}
> {{Usecase example:}}
> {{Currently its not possible to list all the VIEWS, but other technologies 
> like hive support it using 'SHOW VIEWS', mysql supports it using a more 
> complex query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}}
> Decide to take mysql approach as it provides more flexibility for querying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30517) Support SHOW TABLES EXTENDED

2020-01-15 Thread Ajith S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015706#comment-17015706
 ] 

Ajith S commented on SPARK-30517:
-

[~srowen] [~dongjoon] [~vanzin] Please let me know about your opinion about 
proposal. I would like to work if its acceptable

> Support SHOW TABLES EXTENDED
> 
>
> Key: SPARK-30517
> URL: https://issues.apache.org/jira/browse/SPARK-30517
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Major
>
> {{Intention is to support show tables with a additional column 'type' where 
> type can be MANAGED,EXTERNAL,VIEW using which user can query only tables of 
> required types, like listing only views or only external tables (using a 
> 'where' clause over 'type' column).}}
> {{Usecase example:}}
> {{Currently its not possible to list all the VIEWS, but other technologies 
> like hive support it using 'SHOW VIEWS', mysql supports it using a more 
> complex query 'SHOW FULL TABLES WHERE table_type = 'VIEW';'}}
> Decide to take mysql approach as it provides more flexibility for querying.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org