date:20190410

[jira] [Created] (SPARK-27438) Increase precision of to_timestamp

2019-04-10 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-27438:
--

 Summary: Increase precision of to_timestamp
 Key: SPARK-27438
 URL: https://issues.apache.org/jira/browse/SPARK-27438
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The to_timestamp() function can parse input string up to second precision even 
if the specified pattern contains second fraction sub-pattern. The ticket aims 
to improve precision of to_timestamp() up to microsecond precision. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27437) check the parquet column format

2019-04-10 Thread xzh_dz (JIRA)

xzh_dz created SPARK-27437:
--

 Summary: check the parquet column format
 Key: SPARK-27437
 URL: https://issues.apache.org/jira/browse/SPARK-27437
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.1
Reporter: xzh_dz






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27436) Add spark.sql.optimizer.nonExcludedRules

2019-04-10 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27436:
-

 Summary: Add spark.sql.optimizer.nonExcludedRules
 Key: SPARK-27436
 URL: https://issues.apache.org/jira/browse/SPARK-27436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to add `spark.sql.optimizer.nonExcludedRules` static 
configuration to prevent accidental rule exclusion by users in SQL environment 
dynamically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-10 Thread Lantao Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815028#comment-16815028
 ] 

Lantao Jin edited comment on SPARK-27337 at 4/11/19 2:47 AM:
-

[~vinooganesh] Could you point me the subject of mail thread in dev list?


was (Author: cltlfcjin):
[~vinooganesh] Could you point me the title of mail thread in dev list?

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-10 Thread Lantao Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815028#comment-16815028
 ] 

Lantao Jin commented on SPARK-27337:


[~vinooganesh] Could you point me the title of mail thread in dev list?

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26568) Too many partitions may cause thriftServer frequently Full GC

2019-04-10 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815016#comment-16815016
 ] 

zhoukang commented on SPARK-26568:
--

It is hive client we used in spark cause this problem [~srowen]

> Too many partitions may cause thriftServer frequently Full GC
> -
>
> Key: SPARK-26568
> URL: https://issues.apache.org/jira/browse/SPARK-26568
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> The reason is that:
> first we have a table with many partitions(may be several hundred);second, we 
> have some concurrent queries.Then the long-running thriftServer may encounter 
> OOM issue.
> Here is a case:
> call stack of OOM thread:
> {code:java}
> pool-34-thread-10
>   at 
> org.apache.hadoop.hive.metastore.api.StorageDescriptor.(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V
>  (StorageDescriptor.java:240)
>   at 
> org.apache.hadoop.hive.metastore.api.Partition.(Lorg/apache/hadoop/hive/metastore/api/Partition;)V
>  (Partition.java:216)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopy(Lorg/apache/hadoop/hive/metastore/api/Partition;)Lorg/apache/hadoop/hive/metastore/api/Partition;
>  (HiveMetaStoreClient.java:1343)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/Collection;Ljava/util/List;)Ljava/util/List;
>  (HiveMetaStoreClient.java:1409)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/List;)Ljava/util/List;
>  (HiveMetaStoreClient.java:1397)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List;
>  (HiveMetaStoreClient.java:914)
>   at 
> sun.reflect.GeneratedMethodAccessor98.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (DelegatingMethodAccessorImpl.java:43)
>   at 
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Method.java:606)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object;
>  (RetryingMetaStoreClient.java:90)
>   at 
> com.sun.proxy.$Proxy30.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List;
>  (Unknown Source)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Table;Ljava/lang/String;)Ljava/util/List;
>  (Hive.java:1967)
>   at 
> sun.reflect.GeneratedMethodAccessor97.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (DelegatingMethodAccessorImpl.java:43)
>   at 
> java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;
>  (Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Hive;Lorg/apache/hadoop/hive/ql/metadata/Table;Lscala/collection/Seq;)Lscala/collection/Seq;
>  (HiveShim.scala:602)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Lscala/collection/Seq;
>  (HiveClientImpl.scala:608)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Ljava/lang/Object;
>  (HiveClientImpl.scala:606)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply()Ljava/lang/Object;
>  (HiveClientImpl.scala:321)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(Lscala/Function0;Lscala/runtime/IntRef;Lscala/runtime/ObjectRef;Ljava/lang/Object;)V
>  (HiveClientImpl.scala:264)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(Lscala/Function0;)Ljava/lang/Object;
>  (HiveClientImpl.scala:263)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(Lscala/Function0;)Ljava/lang/Object;
>  (HiveClientImpl.scala:307)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;Lscala/collection/Seq;)Lscala/collection/Seq;
>  (HiveClientImpl.scala:606)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply()Lscala/collection/Seq;
>  (HiveExternalCatalog.scala:1017)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply()Ljava/lang/Object;
>  (HiveExternalCatalog.scala:1000)
>   at 
>

[jira] [Commented] (SPARK-26703) Hive record writer will always depends on parquet-1.6 writer should fix it

2019-04-10 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815017#comment-16815017
 ] 

zhoukang commented on SPARK-26703:
--

I can make a pr for this [~hyukjin.kwon]

> Hive record writer will always depends on parquet-1.6 writer should fix it 
> ---
>
> Key: SPARK-26703
> URL: https://issues.apache.org/jira/browse/SPARK-26703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Currently, when we are using insert into hive table related command.
> The parquet file generated will always be version 1.6,reason is below:
> 1. we rely on hive-exec HiveFileFormatUtils to get recordWriter
> {code:java}
> private val hiveWriter = HiveFileFormatUtils.getHiveRecordWriter(
> jobConf,
> tableDesc,
> serializer.getSerializedClass,
> fileSinkConf,
> new Path(path),
> Reporter.NULL)
> {code}
> 2. we will call 
> {code:java}
> public static RecordWriter getHiveRecordWriter(JobConf jc,
>   TableDesc tableInfo, Class outputClass,
>   FileSinkDesc conf, Path outPath, Reporter reporter) throws 
> HiveException {
> HiveOutputFormat hiveOutputFormat = getHiveOutputFormat(jc, 
> tableInfo);
> try {
>   boolean isCompressed = conf.getCompressed();
>   JobConf jc_output = jc;
>   if (isCompressed) {
> jc_output = new JobConf(jc);
> String codecStr = conf.getCompressCodec();
> if (codecStr != null && !codecStr.trim().equals("")) {
>   Class codec = 
>   (Class) 
> JavaUtils.loadClass(codecStr);
>   FileOutputFormat.setOutputCompressorClass(jc_output, codec);
> }
> String type = conf.getCompressType();
> if (type != null && !type.trim().equals("")) {
>   CompressionType style = CompressionType.valueOf(type);
>   SequenceFileOutputFormat.setOutputCompressionType(jc, style);
> }
>   }
>   return getRecordWriter(jc_output, hiveOutputFormat, outputClass,
>   isCompressed, tableInfo.getProperties(), outPath, reporter);
> } catch (Exception e) {
>   throw new HiveException(e);
> }
>   }
>   public static RecordWriter getRecordWriter(JobConf jc,
>   OutputFormat outputFormat,
>   Class valueClass, boolean isCompressed,
>   Properties tableProp, Path outPath, Reporter reporter
>   ) throws IOException, HiveException {
> if (!(outputFormat instanceof HiveOutputFormat)) {
>   outputFormat = new HivePassThroughOutputFormat(outputFormat);
> }
> return ((HiveOutputFormat)outputFormat).getHiveRecordWriter(
> jc, outPath, valueClass, isCompressed, tableProp, reporter);
>   }
> {code}
> 3. then in MapredParquetOutPutFormat
> {code:java}
> public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter 
> getHiveRecordWriter(
>   final JobConf jobConf,
>   final Path finalOutPath,
>   final Class valueClass,
>   final boolean isCompressed,
>   final Properties tableProperties,
>   final Progressable progress) throws IOException {
> LOG.info("creating new record writer..." + this);
> final String columnNameProperty = 
> tableProperties.getProperty(IOConstants.COLUMNS);
> final String columnTypeProperty = 
> tableProperties.getProperty(IOConstants.COLUMNS_TYPES);
> List columnNames;
> List columnTypes;
> if (columnNameProperty.length() == 0) {
>   columnNames = new ArrayList();
> } else {
>   columnNames = Arrays.asList(columnNameProperty.split(","));
> }
> if (columnTypeProperty.length() == 0) {
>   columnTypes = new ArrayList();
> } else {
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
> }
> 
> DataWritableWriteSupport.setSchema(HiveSchemaConverter.convert(columnNames, 
> columnTypes), jobConf);
> return getParquerRecordWriterWrapper(realOutputFormat, jobConf, 
> finalOutPath.toString(),
> progress,tableProperties);
>   }
> {code}
> 4. then call 
> {code:java}
> public ParquetRecordWriterWrapper(
>   final OutputFormat realOutputFormat,
>   final JobConf jobConf,
>   final String name,
>   final Progressable progress, Properties tableProperties) throws
>   IOException {
> try {
>   // create a TaskInputOutputContext
>   TaskAttemptID taskAttemptID = 
> TaskAttemptID.forName(jobConf.get("mapred.task.id"));
>   if (taskAttemptID == null) {
> taskAttemptID = new TaskAttemptID();
>   }
>   taskContext = ContextUtil.newTaskAttemptContext(jobConf, taskAttemptID);
>   LOG.info("initialize serde with table properties.");
>   initializeSerProperties(taskContext, tableProperties);
>

[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver

2019-04-10 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815015#comment-16815015
 ] 

zhoukang commented on SPARK-26533:
--

I will work on this

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-04-10 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815010#comment-16815010
 ] 

shahid commented on SPARK-27068:


Already failed jobs is in a single table right?. I meant, For running jobs, 
there will be one table, For completed one there will be another table and for 
failed jobs also there will be a seperate table. right? You mean to say, while 
clean up jobs we shouldn't remove from failed jobs table?

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue

2019-04-10 Thread zhoukang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815005#comment-16815005
 ] 

zhoukang commented on SPARK-27068:
--

For long-running application like spark sql,showing failed jobs in a single 
table will keep more clue for later debug [~shahid]

> Support failed jobs ui and completed jobs ui use different queue
> 
>
> Key: SPARK-27068
> URL: https://issues.apache.org/jira/browse/SPARK-27068
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> For some long running jobs,we may want to check out the cause of some failed 
> jobs.
> But most jobs has completed and failed jobs ui may disappear, we can use 
> different queue for this two kinds of jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor

2019-04-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27088:


Assignee: Chakravarthi

> Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in 
> RuleExecutor
> -
>
> Key: SPARK-27088
> URL: https://issues.apache.org/jira/browse/SPARK-27088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Chakravarthi
>Priority: Minor
>
> Similar to SPARK-25415, which has made log level for plan changes by each 
> rule configurable, we can make log level for plan changes by each batch 
> configurable too and can reuse the same configuration: 
> "spark.sql.optimizer.planChangeLog.level".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor

2019-04-10 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27088.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24136
[https://github.com/apache/spark/pull/24136]

> Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in 
> RuleExecutor
> -
>
> Key: SPARK-27088
> URL: https://issues.apache.org/jira/browse/SPARK-27088
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maryann Xue
>Assignee: Chakravarthi
>Priority: Minor
> Fix For: 3.0.0
>
>
> Similar to SPARK-25415, which has made log level for plan changes by each 
> rule configurable, we can make log level for plan changes by each batch 
> configurable too and can reuse the same configuration: 
> "spark.sql.optimizer.planChangeLog.level".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish

2019-04-10 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-27394:
-
Fix Version/s: 2.4.2

> The staleness of UI may last minutes or hours when no tasks start or finish
> ---
>
> Key: SPARK-27394
> URL: https://issues.apache.org/jira/browse/SPARK-27394
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> Run the following codes on a cluster that has at least 2 cores.
> {code}
> sc.makeRDD(1 to 1000, 1000).foreach { i =>
>   Thread.sleep(30)
> }
> {code}
> The jobs page will just show one running task.
> This is because when the second task event calls 
> "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap 
> between two events is smaller than `spark.ui.liveUpdate.period`.
> After the second task event, in the above case, because there won't be any 
> other task events, the Spark UI will be always stale until the next task 
> event gets fired (after 300 seconds).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-10 Thread Vinoo Ganesh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoo Ganesh updated SPARK-27337:
-
Summary: QueryExecutionListener never cleans up listeners from the bus 
after SparkSession is cleared  (was: QueryExecutionListener never cleans up 
listeners from the bus after SparkSession is closed)

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-10 Thread Vinoo Ganesh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814885#comment-16814885
 ] 

Vinoo Ganesh commented on SPARK-27337:
--

Clearing the spark session* Updated the ticket title, and there is a thread on 
the spark dev list about this too 

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-10 Thread Robert Joseph Evans (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814883#comment-16814883
 ] 

Robert Joseph Evans commented on SPARK-27396:
-

This SPIP has been up for 5 days and I see 10 people following it, but there 
has been no discussion since. Are you still digesting the proposal?  Are you 
just busy with an upcoming conference and haven't had time to look at it?  From 
the previous discussion it sounded like the core of what I am proposing is not 
that controversial, so I would like to move forward with it sooner than later, 
but I also want to give everyone time to understand it and ask questions.

Also I am looking for someone to be the shepherd.  I can technically do it, 
being a PMC member, but I have not been that active until recently so to avoid 
any concerns I would prefer to find someone else to be the shepard.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to

[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-10 Thread Eric Maynard (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Maynard updated SPARK-27421:
-
Affects Version/s: 2.4.1

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at scala.collection.immutable.List.foreach(List.scala:392)
>  at 
>

[jira] [Commented] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2019-04-10 Thread Eric Maynard (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814872#comment-16814872
 ] 

Eric Maynard commented on SPARK-27421:
--

[~shivuson...@gmail.com] Any hiccup confirming the issue with the 3 lines in 
the Jira? I am able to replicate this pretty reliably on 2.4.0

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Priority: Minor
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
>

[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-10 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814762#comment-16814762
 ] 

shane knapp commented on SPARK-25079:
-

pyarrow 0.12.1 is out, i've updated my conda test environment at am currently 
waiting for the PRB to finish:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104490

> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26218) Throw exception on overflow for integers

2019-04-10 Thread Reynold Xin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814709#comment-16814709
 ] 

Reynold Xin commented on SPARK-26218:
-

The no-exception is by design. Imagine you have an ETL job that runs for hours, 
and then it suddenly throws an exception because one row overflows ...

 

> Throw exception on overflow for integers
> 
>
> Key: SPARK-26218
> URL: https://issues.apache.org/jira/browse/SPARK-26218
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-24598 just updated the documentation in order to state that our 
> addition is a Java style one and not a SQL style. But in order to follow the 
> SQL standard we should instead throw an exception if an overflow occurs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27435) Support schema pruning in file source V2

2019-04-10 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27435:
--

 Summary: Support schema pruning in file source V2
 Key: SPARK-27435
 URL: https://issues.apache.org/jira/browse/SPARK-27435
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Currently, the optimization rule `SchemaPruning` only works for Parquet/Orc V1.
We should have the same optimization in file source V2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail

2019-04-10 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-27419.
--
   Resolution: Fixed
Fix Version/s: 2.4.2

> When setting spark.executor.heartbeatInterval to a value less than 1 seconds, 
> it will always fail
> -
>
> Key: SPARK-27419
> URL: https://issues.apache.org/jira/browse/SPARK-27419
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.2
>
>
> When setting spark.executor.heartbeatInterval to a value less than 1 seconds 
> in branch-2.4, it will always fail because the value will be converted to 0 
> and the heartbeat will always timeout and finally kill the executor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27434) memory leak in spark driver

2019-04-10 Thread Ryne Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryne Yang updated SPARK-27434:
--
Description: 
we got a OOM exception on the driver after driver has completed multiple 
jobs(we are reusing spark context). 

so we took a heap dump and looked at the leak analysis, found out that under 
AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 

 

can someone take a look at? 

here is the heap analysis: 

!Screen Shot 2019-04-10 at 12.11.35 PM.png!

 

  was:
we got a OOM exception on the driver after driver has completed multiple 
jobs(we are reusing spark context). 

so we took a heap dump and looked at the leak analysis, found out that under 
AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 

 

can someone take a look at? 

here is the heap analysis: 

!image-2019-04-10-12-24-03-184.png!

 


> memory leak in spark driver
> ---
>
> Key: SPARK-27434
> URL: https://issues.apache.org/jira/browse/SPARK-27434
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: OS: Centos 7
> JVM: 
> **_openjdk version "1.8.0_201"_
> _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_
> _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_
> Spark version: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
> Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png
>
>
> we got a OOM exception on the driver after driver has completed multiple 
> jobs(we are reusing spark context). 
> so we took a heap dump and looked at the leak analysis, found out that under 
> AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 
>  
> can someone take a look at? 
> here is the heap analysis: 
> !Screen Shot 2019-04-10 at 12.11.35 PM.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27434) memory leak in spark driver

2019-04-10 Thread Ryne Yang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryne Yang updated SPARK-27434:
--
Attachment: Screen Shot 2019-04-10 at 12.11.35 PM.png

> memory leak in spark driver
> ---
>
> Key: SPARK-27434
> URL: https://issues.apache.org/jira/browse/SPARK-27434
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: OS: Centos 7
> JVM: 
> **_openjdk version "1.8.0_201"_
> _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_
> _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_
> Spark version: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
> Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png
>
>
> we got a OOM exception on the driver after driver has completed multiple 
> jobs(we are reusing spark context). 
> so we took a heap dump and looked at the leak analysis, found out that under 
> AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 
>  
> can someone take a look at? 
> here is the heap analysis: 
> !image-2019-04-10-12-24-03-184.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27434) memory leak in spark driver

2019-04-10 Thread Ryne Yang (JIRA)

Ryne Yang created SPARK-27434:
-

 Summary: memory leak in spark driver
 Key: SPARK-27434
 URL: https://issues.apache.org/jira/browse/SPARK-27434
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: OS: Centos 7

JVM: 

**_openjdk version "1.8.0_201"_
_OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_
_OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_

Spark version: 2.4.0
Reporter: Ryne Yang


we got a OOM exception on the driver after driver has completed multiple 
jobs(we are reusing spark context). 

so we took a heap dump and looked at the leak analysis, found out that under 
AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 

 

can someone take a look at? 

here is the heap analysis: 

!image-2019-04-10-12-24-03-184.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows

2019-04-10 Thread Binit (JIRA)

Binit created SPARK-27433:
-

 Summary: Spark Structured Streaming left outer joins returns outer 
nulls for already matched rows
 Key: SPARK-27433
 URL: https://issues.apache.org/jira/browse/SPARK-27433
 Project: Spark
  Issue Type: Question
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Binit


I m basically using the example given in Spark's the documentation here: 
[https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking]
 with the built-in test stream in which one stream is ahead by 3 seconds (was 
originally using kafka but ran into the same issue). The results returned the 
match columns correctly, however after a while the same key is returned with an 
outer null.

Is this the expected behavior? Is there a way to exclude the duplicate outer 
null results when there was a match?

Code:

{{val testStream = session.readStream.format("rate") .option("rowsPerSecond", 
"5").option("numPartitions", "1").load() val impressions = testStream .select( 
(col("value") + 15).as("impressionAdId"), 
col("timestamp").as("impressionTime")) val clicks = testStream .select( 
col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply 
watermarks on event-time columns val impressionsWithWatermark = 
impressions.withWatermark("impressionTime", "20 seconds") val 
clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join 
with event-time constraints val result = impressionsWithWatermark.join( 
clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= 
impressionTime AND clickTime <= impressionTime + interval 10 seconds """), 
joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val query 
= result.writeStream.outputMode("update").format("console").option("truncate", 
false).start() query.awaitTermination()}}

Result:

{{--- Batch: 19 
--- 
+--+---+-+---+ 
|impressionAdId|impressionTime |clickAdId|clickTime | 
+--+---+-+---+ |100 
|2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 
22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 
|2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 
22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| 
+--+---+-+---+ 
--- Batch: 57 
--- 
+--+---+-+---+ 
|impressionAdId|impressionTime |clickAdId|clickTime | 
+--+---+-+---+ |290 
|2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 
22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 
|2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 
22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| |100 
|2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null |null | 
|103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 22:18:38.562|null 
|null | |102 |2018-05-23 22:18:38.762|null |null | 
+--+---+-+---+}}

{{This question is also asked in the stackoverflow. Please find the link below}}

{{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}}

{{ }}

{{101 & 103 have already come in the join but still it is coming in the outer 
left join.}}

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27432) Spark job stuck when no jobs/stages are pending

2019-04-10 Thread Rajashekar (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajashekar updated SPARK-27432:
---
Attachment: jobs.pdf
stages.pdf

> Spark job stuck when no jobs/stages are pending
> ---
>
> Key: SPARK-27432
> URL: https://issues.apache.org/jira/browse/SPARK-27432
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.1
>Reporter: Rajashekar
>Priority: Major
> Attachments: jobs.pdf, stages.pdf
>
>
> Spark job is never ending and when i check in spark UI there are no pending 
> jobs and no pending stages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27432) Spark job stuck when no jobs/stages are pending

2019-04-10 Thread Rajashekar (JIRA)

Rajashekar created SPARK-27432:
--

 Summary: Spark job stuck when no jobs/stages are pending
 Key: SPARK-27432
 URL: https://issues.apache.org/jira/browse/SPARK-27432
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Submit
Affects Versions: 2.2.1
Reporter: Rajashekar


Spark job is never ending and when i check in spark UI there are no pending 
jobs and no pending stages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27423:
---

Assignee: Maxim Gekk

> Cast DATE to/from TIMESTAMP according to SQL standard
> -
>
> Key: SPARK-27423
> URL: https://issues.apache.org/jira/browse/SPARK-27423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> According to SQL standard, DATE is union of (year, month, day). To convert it 
> to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be 
> extended by time at midnight - (year, month, day, hour = 0, minute = 0, 
> seconds = 0). The former timestamp should be considered as a timestamp at the 
> session time zone, and transformed to microseconds since epoch in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27423.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24332
[https://github.com/apache/spark/pull/24332]

> Cast DATE to/from TIMESTAMP according to SQL standard
> -
>
> Key: SPARK-27423
> URL: https://issues.apache.org/jira/browse/SPARK-27423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to SQL standard, DATE is union of (year, month, day). To convert it 
> to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be 
> extended by time at midnight - (year, month, day, hour = 0, minute = 0, 
> seconds = 0). The former timestamp should be considered as a timestamp at the 
> session time zone, and transformed to microseconds since epoch in UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27422) CurrentDate should return local date

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27422:
---

Assignee: Maxim Gekk

> CurrentDate should return local date
> 
>
> Key: SPARK-27422
> URL: https://issues.apache.org/jira/browse/SPARK-27422
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> According to SQL standard, DATE type is union of (year, month, days), and 
> current date should return a triple of (year, month, days) in session local 
> time zone. The ticket aims to follow the requirement, and calculate a local 
> date for session time zone. The local date should be converted to epoch day, 
> and stored internally in as DATE value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27422) CurrentDate should return local date

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27422.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24330
[https://github.com/apache/spark/pull/24330]

> CurrentDate should return local date
> 
>
> Key: SPARK-27422
> URL: https://issues.apache.org/jira/browse/SPARK-27422
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> According to SQL standard, DATE type is union of (year, month, days), and 
> current date should return a triple of (year, month, days) in session local 
> time zone. The ticket aims to follow the requirement, and calculate a local 
> date for session time zone. The local date should be converted to epoch day, 
> and stored internally in as DATE value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage

2019-04-10 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27412.
---
Resolution: Won't Fix

> Add a new shuffle manager to use Persistent Memory as shuffle and spilling 
> storage
> --
>
> Key: SPARK-27412
> URL: https://issues.apache.org/jira/browse/SPARK-27412
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Chendi.Xue
>Priority: Minor
>  Labels: core
> Attachments: PmemShuffleManager-DesignDoc.pdf
>
>
> Add a new shuffle manager called "PmemShuffleManager", by using which, we can 
> use Persistent Memory Device as storage for shuffle and external sorter 
> spilling.
> In this implementation, we leveraged Persistent Memory Development Kit(PMDK) 
> to support transaction write with high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27431) move HashedRelation to global UnifiedMemoryManager and enable offheap

2019-04-10 Thread Xiaoju Wu (JIRA)

Xiaoju Wu created SPARK-27431:
-

 Summary: move HashedRelation to global UnifiedMemoryManager and 
enable offheap
 Key: SPARK-27431
 URL: https://issues.apache.org/jira/browse/SPARK-27431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiaoju Wu


Why is HashedRelation currently managed by a newly created MemoryManager and 
disabled with offheap? Can we improve this part?

/**
 * Create a HashedRelation from an Iterator of InternalRow.
 */
def apply(
 input: Iterator[InternalRow],
 key: Seq[Expression],
 sizeEstimate: Int = 64,
 taskMemoryManager: TaskMemoryManager = null): HashedRelation = {
 val mm = Option(taskMemoryManager).getOrElse {
 new TaskMemoryManager(
 new UnifiedMemoryManager(
 new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"),
 Long.MaxValue,
 Long.MaxValue / 2,
 1),
 0)
 }

 if (key.length == 1 && key.head.dataType == LongType) {
 LongHashedRelation(input, key, sizeEstimate, mm)
 } else {
 UnsafeHashedRelation(input, key, sizeEstimate, mm)
 }
}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian

2019-04-10 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-27428:
--
Description: 
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  

Also I want to know, which feature of Apache Spark is tested in this test.

  was:
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  

Also I want to know, which feature of Apache Spark is tested in this this test.


> Test "metrics StatsD sink with Timer " fails on BigEndian
> -
>
> Key: SPARK-27428
> URL: https://issues.apache.org/jira/browse/SPARK-27428
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.3.4
> Environment: Working on Ubuntu16.04, Linux 
> Java versions : 
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit 
> Compressed References 20190205_218 (JIT enabled, AOT enabled)
> and
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
> OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)
>Reporter: Anuja Jakhade
>Priority: Major
>
> Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error
> java.net.SocketTimeoutException: Receive timed out
>  at 
> java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
>  at java.net.DatagramSocket.receive(DatagramSocket.java:812)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
>  at scala.collection.immutable.Range.foreach(Range.scala:160)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
>  at 
>

[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian

2019-04-10 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-27428:
--
Description: 
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

```java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  ```

 

Also I want to know, which feature of Apache Spark is tested in this this test.

  was:
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  

Also I want to know, which feature of Apache Spark is tested in this test.


> Test "metrics StatsD sink with Timer " fails on BigEndian
> -
>
> Key: SPARK-27428
> URL: https://issues.apache.org/jira/browse/SPARK-27428
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.3.4
> Environment: Working on Ubuntu16.04, Linux 
> Java versions : 
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit 
> Compressed References 20190205_218 (JIT enabled, AOT enabled)
> and
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
> OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)
>Reporter: Anuja Jakhade
>Priority: Major
>
> Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error
> ```java.net.SocketTimeoutException: Receive timed out
>  at 
> java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
>  at java.net.DatagramSocket.receive(DatagramSocket.java:812)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
>  at scala.collection.immutable.Range.foreach(Range.scala:160)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
>  at 
>

[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian

2019-04-10 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-27428:
--
Description: 
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  

 

Also I want to know, which feature of Apache Spark is tested in this this test.

  was:
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

```java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  ```

 

Also I want to know, which feature of Apache Spark is tested in this this test.


> Test "metrics StatsD sink with Timer " fails on BigEndian
> -
>
> Key: SPARK-27428
> URL: https://issues.apache.org/jira/browse/SPARK-27428
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.3.4
> Environment: Working on Ubuntu16.04, Linux 
> Java versions : 
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit 
> Compressed References 20190205_218 (JIT enabled, AOT enabled)
> and
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
> OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)
>Reporter: Anuja Jakhade
>Priority: Major
>
> Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error
> java.net.SocketTimeoutException: Receive timed out
>  at 
> java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
>  at java.net.DatagramSocket.receive(DatagramSocket.java:812)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
>  at scala.collection.immutable.Range.foreach(Range.scala:160)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
>  at 
>

[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian

2019-04-10 Thread Anuja Jakhade (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anuja Jakhade updated SPARK-27428:
--
Description: 
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  

Also I want to know, which feature of Apache Spark is tested in this this test.

  was:
Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  


> Test "metrics StatsD sink with Timer " fails on BigEndian
> -
>
> Key: SPARK-27428
> URL: https://issues.apache.org/jira/browse/SPARK-27428
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.3.3, 2.3.4
> Environment: Working on Ubuntu16.04, Linux 
> Java versions : 
> Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit 
> Compressed References 20190205_218 (JIT enabled, AOT enabled)
> and
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
> OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)
>Reporter: Anuja Jakhade
>Priority: Major
>
> Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error
> java.net.SocketTimeoutException: Receive timed out
>  at 
> java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
>  at java.net.DatagramSocket.receive(DatagramSocket.java:812)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
>  at scala.collection.immutable.Range.foreach(Range.scala:160)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
>  at 
> org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)
> On debugging observed

[jira] [Updated] (SPARK-27430) BroadcastNestedLoopJoinExec can build any side whatever the join type is

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-27430:

Summary: BroadcastNestedLoopJoinExec can build any side whatever the join 
type is  (was: BroadcastNestedLoopJoinExec should support all join types)

> BroadcastNestedLoopJoinExec can build any side whatever the join type is
> 
>
> Key: SPARK-27430
> URL: https://issues.apache.org/jira/browse/SPARK-27430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27430) BroadcastNestedLoopJoinExec should support all join types

2019-04-10 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27430:
---

 Summary: BroadcastNestedLoopJoinExec should support all join types
 Key: SPARK-27430
 URL: https://issues.apache.org/jira/browse/SPARK-27430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27429) [SQL] to_timestamp function with additional argument flag that will allow exception if value could not be cast

2019-04-10 Thread t oo (JIRA)

t oo created SPARK-27429:


 Summary: [SQL] to_timestamp function with additional argument flag 
that will allow exception if value could not be cast
 Key: SPARK-27429
 URL: https://issues.apache.org/jira/browse/SPARK-27429
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: t oo


If I am running a SQL on a csv based dataframe and my query has  
to_timestamp(input_col,'-MM-dd HH:mm:ss'), if the values in input_col are 
not really timestamp like 'ABC' then I would like to_timestamp function to 
throw an exception rather than happily (silently) return the values as null.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26012.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24334
[https://github.com/apache/spark/pull/24334]

> Dynamic partition will fail when both '' and null values are taken as dynamic 
> partition values simultaneously.
> --
>
> Key: SPARK-26012
> URL: https://issues.apache.org/jira/browse/SPARK-26012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: eaton
>Assignee: eaton
>Priority: Major
> Fix For: 3.0.0
>
>
> Dynamic partition will fail when both '' and null values are taken as dynamic 
> partition values simultaneously.
>  For example, the test bellow will fail.
> test("Null and '' values should not cause dynamic partition failure of string 
> types") {
>  withTable("t1", "t2")
> { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when 
> id = 1 then '' else null end as string) as p" + " from 
> t1").write.partitionBy("p").saveAsTable("t2") 
> checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), 
> Row(2, null))) }
> }
> The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already 
> exists'.
>  
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already 
> exists: 
> [file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet|file:///F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet]
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
>  at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
>  ... 10 more
> 20:43:55.460 WARN 
> org.apache.spark.sql.execution.datasources.FileFormatWriterSuite:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26012:
---

Assignee: eaton

> Dynamic partition will fail when both '' and null values are taken as dynamic 
> partition values simultaneously.
> --
>
> Key: SPARK-26012
> URL: https://issues.apache.org/jira/browse/SPARK-26012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: eaton
>Assignee: eaton
>Priority: Major
>
> Dynamic partition will fail when both '' and null values are taken as dynamic 
> partition values simultaneously.
>  For example, the test bellow will fail.
> test("Null and '' values should not cause dynamic partition failure of string 
> types") {
>  withTable("t1", "t2")
> { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when 
> id = 1 then '' else null end as string) as p" + " from 
> t1").write.partitionBy("p").saveAsTable("t2") 
> checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), 
> Row(2, null))) }
> }
> The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already 
> exists'.
>  
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already 
> exists: 
> [file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet|file:///F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet]
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289)
>  at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
>  at 
> org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236)
>  at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245)
>  ... 10 more
> 20:43:55.460 WARN 
> org.apache.spark.sql.execution.datasources.FileFormatWriterSuite:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian

2019-04-10 Thread Anuja Jakhade (JIRA)

Anuja Jakhade created SPARK-27428:
-

 Summary: Test "metrics StatsD sink with Timer " fails on BigEndian
 Key: SPARK-27428
 URL: https://issues.apache.org/jira/browse/SPARK-27428
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.3, 2.3.2, 2.3.4
 Environment: Working on Ubuntu16.04, Linux 

Java versions : 

Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit Compressed 
References 20190205_218 (JIT enabled, AOT enabled)

and

openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode)
Reporter: Anuja Jakhade


Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error

java.net.SocketTimeoutException: Receive timed out
 at 
java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
 at java.net.DatagramSocket.receive(DatagramSocket.java:812)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154)
 at scala.collection.immutable.Range.foreach(Range.scala:160)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123)
 at 
org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123)

On debugging observed that the last packet is not received at 
"socket.receive(p)". Hence the assert fails.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27427) cannot change list of packages from PYSPARK_SUBMIT_ARGS after restarting context

2019-04-10 Thread Natalino Busa (JIRA)

Natalino Busa created SPARK-27427:
-

 Summary: cannot change list of packages from PYSPARK_SUBMIT_ARGS 
after restarting context
 Key: SPARK-27427
 URL: https://issues.apache.org/jira/browse/SPARK-27427
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Natalino Busa


Upon creating the gateway the env variable PYSPARK_SUBMIT_ARGS , if packages 
are provided the jars will be downloaded and added to the list of files to 
distribute to the cluster.

However this mechanism only works once because the gateway is kept, when the 
SparkContext is stopped and created again. Possible root cause the gateway is 
only created once.

This more aggressive SparkSession stop forces the gateway to be recreated
```
def stop(spark_session=None):
try:
sc = None
if spark_session:
sc = spark_session.sparkContext
spark_session.stop()

cls = pyspark.SparkContext
sc = sc or cls._active_spark_context

if sc:
sc.stop()
sc._gateway.shutdown()

cls._active_spark_context = None
cls._gateway = None
cls._jvm = None
except Exception as e:
print(e)
logging.warning('Could not fully stop the engine context')
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-10 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814314#comment-16814314
 ] 

Gabor Somogyi commented on SPARK-27409:
---

ping [~pbharaj]

> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
> E ... 23 more
> E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
> directory)
> E at java.io.FileInputStream.open0(Native Method)
> E at java.io.FileInputStream.open(FileInputStream.java:195)
> E at java.io.FileInputStream.(FileInputStream.java:138)
> E at java.io.FileInputStream.(FileInputStream.java:93)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
> E ... 24 more{code}
>  When running a simple data stream loader for kafka without an SSL cert, it 
> goes through this code block - 
>  
> {code:java}
> ...
> ...
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> ...
> ...{code}
>  
> Note that I haven't selected `trigger=continuous...` when

[jira] [Updated] (SPARK-27425) Add count_if functions

2019-04-10 Thread Chaerim Yeo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaerim Yeo updated SPARK-27425:

Description: 
Add aggregation function which returns the number of records satisfying a given 
condition.

For Presto, 
[{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] 
function is supported, we can write concisely.

However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN 
some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, 
which looks painful.

  was:
Add aggregation function which returns the number of records satisfying a given 
condition.

For Presto, 
[{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] 
function is supported, we can write concisely.

However, Spark does not support yet, we need to write like {{count(case when 
some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, 
which looks painful.


> Add count_if functions
> --
>
> Key: SPARK-27425
> URL: https://issues.apache.org/jira/browse/SPARK-27425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Chaerim Yeo
>Priority: Minor
>
> Add aggregation function which returns the number of records satisfying a 
> given condition.
> For Presto, 
> [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html]
>  function is supported, we can write concisely.
> However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN 
> some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, 
> which looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27425) Add count_if functions

2019-04-10 Thread Chaerim Yeo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaerim Yeo updated SPARK-27425:

Description: 
Add aggregation function which returns the number of records satisfying a given 
condition.

For Presto, 
[{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] 
function is supported, we can write concisely.

However, Spark does not support yet, we need to write like {{count(case when 
some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, 
which looks painful.

  was:
Add aggregation function which returns the number of records satisfying a given 
condition.

For Presto, 
[{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] 
function is supported, we can write concisely.

However, Spark does not support yet, we need to write like {{count(case when 
some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, which 
looks painful.


> Add count_if functions
> --
>
> Key: SPARK-27425
> URL: https://issues.apache.org/jira/browse/SPARK-27425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Chaerim Yeo
>Priority: Minor
>
> Add aggregation function which returns the number of records satisfying a 
> given condition.
> For Presto, 
> [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html]
>  function is supported, we can write concisely.
> However, Spark does not support yet, we need to write like {{count(case when 
> some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, 
> which looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27425) Add count_if functions

2019-04-10 Thread Chaerim Yeo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chaerim Yeo updated SPARK-27425:

Summary: Add count_if functions  (was: SQL count_if functions)

> Add count_if functions
> --
>
> Key: SPARK-27425
> URL: https://issues.apache.org/jira/browse/SPARK-27425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Chaerim Yeo
>Priority: Minor
>
> Add aggregation function which returns the number of records satisfying a 
> given condition.
> For Presto, 
> [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html]
>  function is supported, we can write concisely.
> However, Spark does not support yet, we need to write like {{count(case when 
> some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, 
> which looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is closed

2019-04-10 Thread Lantao Jin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814247#comment-16814247
 ] 

Lantao Jin commented on SPARK-27337:


{quote}
This means that if you close and remake spark sessions fairly frequently, 
you're leaking every single time.
{quote}
What's the "close spark sessions" meaning here? SparkSession.close() will stop 
the underlying SparkContext.
https://github.com/apache/spark/blob/85e5d4f141eedd571dfa0dcdabedced19736a351/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L718

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is closed
> --
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27426) SparkAppHandle states not getting updated in Kubernetes

2019-04-10 Thread Nishant Ranjan (JIRA)

Nishant Ranjan created SPARK-27426:
--

 Summary: SparkAppHandle states not getting updated in Kubernetes
 Key: SPARK-27426
 URL: https://issues.apache.org/jira/browse/SPARK-27426
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
 Environment: # Cent os 7
 # Kubernetes 1.14
 # Spark 2.4.0
Reporter: Nishant Ranjan


While launching Spark application through "startApplication()" , SparkAppHandle 
state is not getting updated. 

sparkLaunch = new SparkLauncher()
 .setSparkHome("/root/test/spark-2.4.0-bin-hadoop2.7")
 .setMaster("k8s://https://172.16.23.30:6443;)
 .setVerbose(true)
 .addSparkArg("--verbose")
 
.setAppResource("local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar")
 .setConf("spark.app.name","spark-pi")
 .setMainClass("org.apache.spark.examples.SparkPi")
 .setConf("spark.executor.instances","5")
 .setConf("spark.kubernetes.container.image","registry.renovite.com/spark:v2")
 .setConf("spark.kubernetes.driver.pod.name","spark-pi-driver")
 .setConf("spark.kubernetes.container.image.pullSecrets","dev-registry-key")
 .setConf("spark.kubernetes.authenticate.driver.serviceAccountName","spark")
 .setDeployMode("cluster")
 ;



SparkAppHandle handle = sparkLaunch.startApplication(); 



Observations:
 # Now, I tried listeners etc but handle.getState() returns UNKNOWN and when 
Spark application is completed. state changes to LOST.
 # SparkAppHandle is not null
 # handle.getAppId() is always null.

My best guess is that communication is not working properly between listener 
and Spark driver in Kubernetes. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27425) SQL count_if functions

2019-04-10 Thread Chaerim Yeo (JIRA)

Chaerim Yeo created SPARK-27425:
---

 Summary: SQL count_if functions
 Key: SPARK-27425
 URL: https://issues.apache.org/jira/browse/SPARK-27425
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.1
Reporter: Chaerim Yeo


Add aggregation function which returns the number of records satisfying a given 
condition.

For Presto, 
[{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] 
function is supported, we can write concisely.

However, Spark does not support yet, we need to write like {{count(case when 
some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, which 
looks painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27406.
-
   Resolution: Fixed
 Assignee: peng bo
Fix Version/s: 3.0.0
   2.4.2

> UnsafeArrayData serialization breaks when two machines have different Oops 
> size
> ---
>
> Key: SPARK-27406
> URL: https://issues.apache.org/jira/browse/SPARK-27406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: peng bo
>Assignee: peng bo
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize 
> endpoints. When the UnsafeArrayData is serialized with Java serialization, 
> the BYTE_ARRAY_OFFSET in memory can change if two machines have different 
> pointer width (Oops in JVM).
> It's similar to SPARK-10914.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120)
>   at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112)
>   at 
>

[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24872:
---

Assignee: hantiantian

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Minor
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24872) Remove the symbol “||” of the “OR” operation

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24872.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21826
[https://github.com/apache/spark/pull/21826]

> Remove the symbol “||” of the “OR” operation
> 
>
> Key: SPARK-24872
> URL: https://issues.apache.org/jira/browse/SPARK-24872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: hantiantian
>Assignee: hantiantian
>Priority: Minor
> Fix For: 3.0.0
>
>
> “||” will perform the function of STRING concat, and it is also the symbol of 
> the "OR" operation.
> When I want use "||" as "OR" operation, I find that it perform the function 
> of STRING concat，
>   spark-sql> explain extended select * from aa where id==1 || id==2;
>    == Parsed Logical Plan ==
>     'Project [*]
>      +- 'Filter (('id = concat(1, 'id)) = 2)
>       +- 'UnresolvedRelation `aa`
>    spark-sql> select "abc" || "DFF" ;
>    And the result is "abcDFF".
> In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27414) make it clear that date type is timezone independent

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27414.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24325
[https://github.com/apache/spark/pull/24325]

> make it clear that date type is timezone independent
> 
>
> Key: SPARK-27414
> URL: https://issues.apache.org/jira/browse/SPARK-27414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-04-10 Thread Thilo Schneider (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814120#comment-16814120
 ] 

Thilo Schneider commented on SPARK-27424:
-

Attached is a - not fully detailed - sketch of the improvement. I would be 
willing to work on this further but would like to get your feedback before 
going into more detail. Do we need a SPIP for this proposal?

> Joining of one stream against the most recent update in another stream
> --
>
> Key: SPARK-27424
> URL: https://issues.apache.org/jira/browse/SPARK-27424
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Thilo Schneider
>Priority: Major
> Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another 
> stream is not possible. This situation arises if one wants to use the current 
> state, of one object, for example when joining the room temperature with the 
> current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the 
> streaming join logic (and state store) to additionally allow joins of the 
> form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
> A.key = B.key 
> AND A.time >= B.time 
> AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to 
> cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-04-10 Thread Thilo Schneider (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thilo Schneider updated SPARK-27424:

Attachment: join-last-update-design.pdf

> Joining of one stream against the most recent update in another stream
> --
>
> Key: SPARK-27424
> URL: https://issues.apache.org/jira/browse/SPARK-27424
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.1
>Reporter: Thilo Schneider
>Priority: Major
> Attachments: join-last-update-design.pdf
>
>
> Currently, adding the most recent update of a row with a given key to another 
> stream is not possible. This situation arises if one wants to use the current 
> state, of one object, for example when joining the room temperature with the 
> current weather.
> This ticket covers creation of a {{stream_lead}} and modification of the 
> streaming join logic (and state store) to additionally allow joins of the 
> form 
> {code:sql}
> SELECT *
> FROM A, B
> WHERE 
> A.key = B.key 
> AND A.time >= B.time 
> AND A.time < stream_lead(B.time)
> {code}
> The major aspect of this change is that we actually need a third watermark to 
> cover how late updates may come. 
> A rough sketch may be found in the attached document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27424) Joining of one stream against the most recent update in another stream

2019-04-10 Thread Thilo Schneider (JIRA)

Thilo Schneider created SPARK-27424:
---

 Summary: Joining of one stream against the most recent update in 
another stream
 Key: SPARK-27424
 URL: https://issues.apache.org/jira/browse/SPARK-27424
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.1
Reporter: Thilo Schneider
 Attachments: join-last-update-design.pdf

Currently, adding the most recent update of a row with a given key to another 
stream is not possible. This situation arises if one wants to use the current 
state, of one object, for example when joining the room temperature with the 
current weather.

This ticket covers creation of a {{stream_lead}} and modification of the 
streaming join logic (and state store) to additionally allow joins of the form 

{code:sql}
SELECT *
FROM A, B
WHERE 
A.key = B.key 
AND A.time >= B.time 
AND A.time < stream_lead(B.time)
{code}

The major aspect of this change is that we actually need a third watermark to 
cover how late updates may come. 

A rough sketch may be found in the attached document.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27181) Add public expression and transform API for DSv2 partitioning

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27181.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24117
[https://github.com/apache/spark/pull/24117]

> Add public expression and transform API for DSv2 partitioning
> -
>
> Key: SPARK-27181
> URL: https://issues.apache.org/jira/browse/SPARK-27181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27181) Add public expression and transform API for DSv2 partitioning

2019-04-10 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27181:
---

Assignee: Ryan Blue

> Add public expression and transform API for DSv2 partitioning
> -
>
> Key: SPARK-27181
> URL: https://issues.apache.org/jira/browse/SPARK-27181
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective

2019-04-10 Thread KaiXu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814094#comment-16814094
 ] 

KaiXu commented on SPARK-27289:
---

I have verified that the intermediate data is written to spark.local.dir which 
is configured in spark-default.conf, while the value set through --conf will 
show on the web UI. That means the value through --conf will not override the 
value in spark-default.conf, what's more, the value shows on the Web UI is not 
the real value where it really works(the UI shows the value through --conf, but 
the real working dir is the value in spark-defaul.conf ). [~Udbhav Agrawal], 
I'm using Spark2.3.3, not know if it's the matter. 

> spark-submit explicit configuration does not take effect but Spark UI shows 
> it's effective
> --
>
> Key: SPARK-27289
> URL: https://issues.apache.org/jira/browse/SPARK-27289
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Spark Submit, Web UI
>Affects Versions: 2.3.3
>Reporter: KaiXu
>Priority: Minor
> Attachments: Capture.PNG
>
>
> The [doc 
> |https://spark.apache.org/docs/latest/submitting-applications.html]says that  
> "In general, configuration values explicitly set on a {{SparkConf}} take the 
> highest precedence, then flags passed to {{spark-submit}}, then values in the 
> defaults file", but when setting spark.local.dir through --conf with 
> spark-submit, it still uses the values from 
> ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI 
> environment variables shows the value from --conf, which is really misleading.
> e.g.
> I set submit my application through the command:
> /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf 
> spark.local.dir=/tmp/spark_local -v --class 
> org.apache.spark.examples.mllib.SparseNaiveBayes --master 
> spark://bdw-slave20:7077 
> /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar 
> hdfs://bdw-slave20:8020/Bayes/Input
>  
> the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is:
> spark.local.dir=/mnt/nvme1/spark_local
> when the application is running, I found the intermediate shuffle data was 
> wrote to /mnt/nvme1/spark_local, which is set through 
> ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the 
> environment value spark.local.dir=/tmp/spark_local.
> The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's 
> misleading. 
>  
> !image-2019-03-27-10-59-38-377.png!
> spark-submit verbose:
> 
> Spark properties used, including those specified through
>  --conf and those from the properties file /opt/spark.conf:
>  (spark.local.dir,/tmp/spark_local)
>  (spark.default.parallelism,132)
>  (spark.driver.memory,10g)
>  (spark.executor.memory,352g)
> X



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

62 matches

Mail list logo