[jira] [Created] (SPARK-27438) Increase precision of to_timestamp
Maxim Gekk created SPARK-27438: -- Summary: Increase precision of to_timestamp Key: SPARK-27438 URL: https://issues.apache.org/jira/browse/SPARK-27438 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The to_timestamp() function can parse input string up to second precision even if the specified pattern contains second fraction sub-pattern. The ticket aims to improve precision of to_timestamp() up to microsecond precision. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27437) check the parquet column format
xzh_dz created SPARK-27437: -- Summary: check the parquet column format Key: SPARK-27437 URL: https://issues.apache.org/jira/browse/SPARK-27437 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.1 Reporter: xzh_dz -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27436) Add spark.sql.optimizer.nonExcludedRules
Dongjoon Hyun created SPARK-27436: - Summary: Add spark.sql.optimizer.nonExcludedRules Key: SPARK-27436 URL: https://issues.apache.org/jira/browse/SPARK-27436 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to add `spark.sql.optimizer.nonExcludedRules` static configuration to prevent accidental rule exclusion by users in SQL environment dynamically. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared
[ https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815028#comment-16815028 ] Lantao Jin edited comment on SPARK-27337 at 4/11/19 2:47 AM: - [~vinooganesh] Could you point me the subject of mail thread in dev list? was (Author: cltlfcjin): [~vinooganesh] Could you point me the title of mail thread in dev list? > QueryExecutionListener never cleans up listeners from the bus after > SparkSession is cleared > --- > > Key: SPARK-27337 > URL: https://issues.apache.org/jira/browse/SPARK-27337 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Critical > Attachments: image001-1.png > > > As a result of > [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3], > it looks like there is a memory leak (specifically > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).] > > Because the Listener Bus on the context still has a reference to the listener > (even after the SparkSession is cleared), they are never cleaned up. This > means that if you close and remake spark sessions fairly frequently, you're > leaking every single time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared
[ https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815028#comment-16815028 ] Lantao Jin commented on SPARK-27337: [~vinooganesh] Could you point me the title of mail thread in dev list? > QueryExecutionListener never cleans up listeners from the bus after > SparkSession is cleared > --- > > Key: SPARK-27337 > URL: https://issues.apache.org/jira/browse/SPARK-27337 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Critical > Attachments: image001-1.png > > > As a result of > [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3], > it looks like there is a memory leak (specifically > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).] > > Because the Listener Bus on the context still has a reference to the listener > (even after the SparkSession is cleared), they are never cleaned up. This > means that if you close and remake spark sessions fairly frequently, you're > leaking every single time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26568) Too many partitions may cause thriftServer frequently Full GC
[ https://issues.apache.org/jira/browse/SPARK-26568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815016#comment-16815016 ] zhoukang commented on SPARK-26568: -- It is hive client we used in spark cause this problem [~srowen] > Too many partitions may cause thriftServer frequently Full GC > - > > Key: SPARK-26568 > URL: https://issues.apache.org/jira/browse/SPARK-26568 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > The reason is that: > first we have a table with many partitions(may be several hundred);second, we > have some concurrent queries.Then the long-running thriftServer may encounter > OOM issue. > Here is a case: > call stack of OOM thread: > {code:java} > pool-34-thread-10 > at > org.apache.hadoop.hive.metastore.api.StorageDescriptor.(Lorg/apache/hadoop/hive/metastore/api/StorageDescriptor;)V > (StorageDescriptor.java:240) > at > org.apache.hadoop.hive.metastore.api.Partition.(Lorg/apache/hadoop/hive/metastore/api/Partition;)V > (Partition.java:216) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopy(Lorg/apache/hadoop/hive/metastore/api/Partition;)Lorg/apache/hadoop/hive/metastore/api/Partition; > (HiveMetaStoreClient.java:1343) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/Collection;Ljava/util/List;)Ljava/util/List; > (HiveMetaStoreClient.java:1409) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.deepCopyPartitions(Ljava/util/List;)Ljava/util/List; > (HiveMetaStoreClient.java:1397) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List; > (HiveMetaStoreClient.java:914) > at > sun.reflect.GeneratedMethodAccessor98.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (DelegatingMethodAccessorImpl.java:43) > at > java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Method.java:606) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(Ljava/lang/Object;Ljava/lang/reflect/Method;[Ljava/lang/Object;)Ljava/lang/Object; > (RetryingMetaStoreClient.java:90) > at > com.sun.proxy.$Proxy30.listPartitionsByFilter(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;S)Ljava/util/List; > (Unknown Source) > at > org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Table;Ljava/lang/String;)Ljava/util/List; > (Hive.java:1967) > at > sun.reflect.GeneratedMethodAccessor97.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (DelegatingMethodAccessorImpl.java:43) > at > java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; > (Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(Lorg/apache/hadoop/hive/ql/metadata/Hive;Lorg/apache/hadoop/hive/ql/metadata/Table;Lscala/collection/Seq;)Lscala/collection/Seq; > (HiveShim.scala:602) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Lscala/collection/Seq; > (HiveClientImpl.scala:608) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply()Ljava/lang/Object; > (HiveClientImpl.scala:606) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply()Ljava/lang/Object; > (HiveClientImpl.scala:321) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(Lscala/Function0;Lscala/runtime/IntRef;Lscala/runtime/ObjectRef;Ljava/lang/Object;)V > (HiveClientImpl.scala:264) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(Lscala/Function0;)Ljava/lang/Object; > (HiveClientImpl.scala:263) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(Lscala/Function0;)Ljava/lang/Object; > (HiveClientImpl.scala:307) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(Lorg/apache/spark/sql/catalyst/catalog/CatalogTable;Lscala/collection/Seq;)Lscala/collection/Seq; > (HiveClientImpl.scala:606) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply()Lscala/collection/Seq; > (HiveExternalCatalog.scala:1017) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply()Ljava/lang/Object; > (HiveExternalCatalog.scala:1000) > at >
[jira] [Commented] (SPARK-26703) Hive record writer will always depends on parquet-1.6 writer should fix it
[ https://issues.apache.org/jira/browse/SPARK-26703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815017#comment-16815017 ] zhoukang commented on SPARK-26703: -- I can make a pr for this [~hyukjin.kwon] > Hive record writer will always depends on parquet-1.6 writer should fix it > --- > > Key: SPARK-26703 > URL: https://issues.apache.org/jira/browse/SPARK-26703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: zhoukang >Priority: Major > > Currently, when we are using insert into hive table related command. > The parquet file generated will always be version 1.6,reason is below: > 1. we rely on hive-exec HiveFileFormatUtils to get recordWriter > {code:java} > private val hiveWriter = HiveFileFormatUtils.getHiveRecordWriter( > jobConf, > tableDesc, > serializer.getSerializedClass, > fileSinkConf, > new Path(path), > Reporter.NULL) > {code} > 2. we will call > {code:java} > public static RecordWriter getHiveRecordWriter(JobConf jc, > TableDesc tableInfo, Class outputClass, > FileSinkDesc conf, Path outPath, Reporter reporter) throws > HiveException { > HiveOutputFormat hiveOutputFormat = getHiveOutputFormat(jc, > tableInfo); > try { > boolean isCompressed = conf.getCompressed(); > JobConf jc_output = jc; > if (isCompressed) { > jc_output = new JobConf(jc); > String codecStr = conf.getCompressCodec(); > if (codecStr != null && !codecStr.trim().equals("")) { > Class codec = > (Class) > JavaUtils.loadClass(codecStr); > FileOutputFormat.setOutputCompressorClass(jc_output, codec); > } > String type = conf.getCompressType(); > if (type != null && !type.trim().equals("")) { > CompressionType style = CompressionType.valueOf(type); > SequenceFileOutputFormat.setOutputCompressionType(jc, style); > } > } > return getRecordWriter(jc_output, hiveOutputFormat, outputClass, > isCompressed, tableInfo.getProperties(), outPath, reporter); > } catch (Exception e) { > throw new HiveException(e); > } > } > public static RecordWriter getRecordWriter(JobConf jc, > OutputFormat outputFormat, > Class valueClass, boolean isCompressed, > Properties tableProp, Path outPath, Reporter reporter > ) throws IOException, HiveException { > if (!(outputFormat instanceof HiveOutputFormat)) { > outputFormat = new HivePassThroughOutputFormat(outputFormat); > } > return ((HiveOutputFormat)outputFormat).getHiveRecordWriter( > jc, outPath, valueClass, isCompressed, tableProp, reporter); > } > {code} > 3. then in MapredParquetOutPutFormat > {code:java} > public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter > getHiveRecordWriter( > final JobConf jobConf, > final Path finalOutPath, > final Class valueClass, > final boolean isCompressed, > final Properties tableProperties, > final Progressable progress) throws IOException { > LOG.info("creating new record writer..." + this); > final String columnNameProperty = > tableProperties.getProperty(IOConstants.COLUMNS); > final String columnTypeProperty = > tableProperties.getProperty(IOConstants.COLUMNS_TYPES); > List columnNames; > List columnTypes; > if (columnNameProperty.length() == 0) { > columnNames = new ArrayList(); > } else { > columnNames = Arrays.asList(columnNameProperty.split(",")); > } > if (columnTypeProperty.length() == 0) { > columnTypes = new ArrayList(); > } else { > columnTypes = > TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty); > } > > DataWritableWriteSupport.setSchema(HiveSchemaConverter.convert(columnNames, > columnTypes), jobConf); > return getParquerRecordWriterWrapper(realOutputFormat, jobConf, > finalOutPath.toString(), > progress,tableProperties); > } > {code} > 4. then call > {code:java} > public ParquetRecordWriterWrapper( > final OutputFormat realOutputFormat, > final JobConf jobConf, > final String name, > final Progressable progress, Properties tableProperties) throws > IOException { > try { > // create a TaskInputOutputContext > TaskAttemptID taskAttemptID = > TaskAttemptID.forName(jobConf.get("mapred.task.id")); > if (taskAttemptID == null) { > taskAttemptID = new TaskAttemptID(); > } > taskContext = ContextUtil.newTaskAttemptContext(jobConf, taskAttemptID); > LOG.info("initialize serde with table properties."); > initializeSerProperties(taskContext, tableProperties); >
[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver
[ https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815015#comment-16815015 ] zhoukang commented on SPARK-26533: -- I will work on this > Support query auto cancel on thriftserver > - > > Key: SPARK-26533 > URL: https://issues.apache.org/jira/browse/SPARK-26533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > Support query auto cancelling when running too long on thriftserver. > For some cases,we use thriftserver as long-running applications. > Some times we want all the query need not to run more than given time. > In these cases,we can enable auto cancel for time-consumed query.Which can > let us release resources for other queries to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815010#comment-16815010 ] shahid commented on SPARK-27068: Already failed jobs is in a single table right?. I meant, For running jobs, there will be one table, For completed one there will be another table and for failed jobs also there will be a seperate table. right? You mean to say, while clean up jobs we shouldn't remove from failed jobs table? > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27068) Support failed jobs ui and completed jobs ui use different queue
[ https://issues.apache.org/jira/browse/SPARK-27068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815005#comment-16815005 ] zhoukang commented on SPARK-27068: -- For long-running application like spark sql,showing failed jobs in a single table will keep more clue for later debug [~shahid] > Support failed jobs ui and completed jobs ui use different queue > > > Key: SPARK-27068 > URL: https://issues.apache.org/jira/browse/SPARK-27068 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.0 >Reporter: zhoukang >Priority: Major > > For some long running jobs,we may want to check out the cause of some failed > jobs. > But most jobs has completed and failed jobs ui may disappear, we can use > different queue for this two kinds of jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27088: Assignee: Chakravarthi > Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in > RuleExecutor > - > > Key: SPARK-27088 > URL: https://issues.apache.org/jira/browse/SPARK-27088 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Assignee: Chakravarthi >Priority: Minor > > Similar to SPARK-25415, which has made log level for plan changes by each > rule configurable, we can make log level for plan changes by each batch > configurable too and can reuse the same configuration: > "spark.sql.optimizer.planChangeLog.level". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27088) Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in RuleExecutor
[ https://issues.apache.org/jira/browse/SPARK-27088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27088. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24136 [https://github.com/apache/spark/pull/24136] > Apply conf "spark.sql.optimizer.planChangeLog.level" to batch plan change in > RuleExecutor > - > > Key: SPARK-27088 > URL: https://issues.apache.org/jira/browse/SPARK-27088 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maryann Xue >Assignee: Chakravarthi >Priority: Minor > Fix For: 3.0.0 > > > Similar to SPARK-25415, which has made log level for plan changes by each > rule configurable, we can make log level for plan changes by each batch > configurable too and can reuse the same configuration: > "spark.sql.optimizer.planChangeLog.level". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27394) The staleness of UI may last minutes or hours when no tasks start or finish
[ https://issues.apache.org/jira/browse/SPARK-27394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-27394: - Fix Version/s: 2.4.2 > The staleness of UI may last minutes or hours when no tasks start or finish > --- > > Key: SPARK-27394 > URL: https://issues.apache.org/jira/browse/SPARK-27394 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 2.4.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > Run the following codes on a cluster that has at least 2 cores. > {code} > sc.makeRDD(1 to 1000, 1000).foreach { i => > Thread.sleep(30) > } > {code} > The jobs page will just show one running task. > This is because when the second task event calls > "AppStatusListener.maybeUpdate" for a job, it will just ignore since the gap > between two events is smaller than `spark.ui.liveUpdate.period`. > After the second task event, in the above case, because there won't be any > other task events, the Spark UI will be always stale until the next task > event gets fired (after 300 seconds). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared
[ https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoo Ganesh updated SPARK-27337: - Summary: QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared (was: QueryExecutionListener never cleans up listeners from the bus after SparkSession is closed) > QueryExecutionListener never cleans up listeners from the bus after > SparkSession is cleared > --- > > Key: SPARK-27337 > URL: https://issues.apache.org/jira/browse/SPARK-27337 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Critical > Attachments: image001-1.png > > > As a result of > [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3], > it looks like there is a memory leak (specifically > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).] > > Because the Listener Bus on the context still has a reference to the listener > (even after the SparkSession is cleared), they are never cleaned up. This > means that if you close and remake spark sessions fairly frequently, you're > leaking every single time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared
[ https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814885#comment-16814885 ] Vinoo Ganesh commented on SPARK-27337: -- Clearing the spark session* Updated the ticket title, and there is a thread on the spark dev list about this too > QueryExecutionListener never cleans up listeners from the bus after > SparkSession is cleared > --- > > Key: SPARK-27337 > URL: https://issues.apache.org/jira/browse/SPARK-27337 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Critical > Attachments: image001-1.png > > > As a result of > [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3], > it looks like there is a memory leak (specifically > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).] > > Because the Listener Bus on the context still has a reference to the listener > (even after the SparkSession is cleared), they are never cleaned up. This > means that if you close and remake spark sessions fairly frequently, you're > leaking every single time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support
[ https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814883#comment-16814883 ] Robert Joseph Evans commented on SPARK-27396: - This SPIP has been up for 5 days and I see 10 people following it, but there has been no discussion since. Are you still digesting the proposal? Are you just busy with an upcoming conference and haven't had time to look at it? From the previous discussion it sounded like the core of what I am proposing is not that controversial, so I would like to move forward with it sooner than later, but I also want to give everyone time to understand it and ask questions. Also I am looking for someone to be the shepherd. I can technically do it, being a PMC member, but I have not been that active until recently so to avoid any concerns I would prefer to find someone else to be the shepard. > SPIP: Public APIs for extended Columnar Processing Support > -- > > Key: SPARK-27396 > URL: https://issues.apache.org/jira/browse/SPARK-27396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Robert Joseph Evans >Priority: Major > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > > The Dataset/DataFrame API in Spark currently only exposes to users one row at > a time when processing data. The goals of this are to > > # Expose to end users a new option of processing the data in a columnar > format, multiple rows at a time, with the data organized into contiguous > arrays in memory. > # Make any transitions between the columnar memory layout and a row based > layout transparent to the end user. > # Allow for simple data exchange with other systems, DL/ML libraries, > pandas, etc. by having clean APIs to transform the columnar data into an > Apache Arrow compatible layout. > # Provide a plugin mechanism for columnar processing support so an advanced > user could avoid data transition between columnar and row based processing > even through shuffles. This means we should at least support pluggable APIs > so an advanced end user can implement the columnar partitioning themselves, > and provide the glue necessary to shuffle the data still in a columnar format. > # Expose new APIs that allow advanced users or frameworks to implement > columnar processing either as UDFs, or by adjusting the physical plan to do > columnar processing. If the latter is too controversial we can move it to > another SPIP, but we plan to implement some accelerated computing in parallel > with this feature to be sure the APIs work, and without this feature it makes > that impossible. > > Not Requirements, but things that would be nice to have. > # Provide default implementations for partitioning columnar data, so users > don’t have to. > # Transition the existing in memory columnar layouts to be compatible with > Apache Arrow. This would make the transformations to Apache Arrow format a > no-op. The existing formats are already very close to those layouts in many > cases. This would not be using the Apache Arrow java library, but instead > being compatible with the memory > [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a > subset of that layout. > # Provide a clean transition from the existing code to the new one. The > existing APIs which are public but evolving are not that far off from what is > being proposed. We should be able to create a new parallel API that can wrap > the existing one. This means any file format that is trying to support > columnar can still do so until we make a conscious decision to deprecate and > then turn off the old APIs. > > *Q2.* What problem is this proposal NOT designed to solve? > This is not trying to implement any of the processing itself in a columnar > way, with the exception of examples for documentation, and possibly default > implementations for partitioning of columnar shuffle. > > *Q3.* How is it done today, and what are the limits of current practice? > The current columnar support is limited to 3 areas. > # Input formats, optionally can return a ColumnarBatch instead of rows. The > code generation phase knows how to take that columnar data and iterate > through it as rows for stages that wants rows, which currently is almost > everything. The limitations here are mostly implementation specific. The > current standard is to abuse Scala’s type erasure to return ColumnarBatches > as the elements of an RDD[InternalRow]. The code generation can handle this > because it is generating java code, so it bypasses scala’s type checking and > just casts the InternalRow to the desired ColumnarBatch. This makes it > difficult for others to
[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
[ https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Maynard updated SPARK-27421: - Affects Version/s: 2.4.1 > RuntimeException when querying a view on a partitioned parquet table > > > Key: SPARK-27421 > URL: https://issues.apache.org/jira/browse/SPARK-27421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1 > Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit > Server VM, Java 1.8.0_141) >Reporter: Eric Maynard >Priority: Minor > > When running a simple query, I get the following stacktrace: > {code} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at >
[jira] [Commented] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
[ https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814872#comment-16814872 ] Eric Maynard commented on SPARK-27421: -- [~shivuson...@gmail.com] Any hiccup confirming the issue with the 3 lines in the Jira? I am able to replicate this pretty reliably on 2.4.0 > RuntimeException when querying a view on a partitioned parquet table > > > Key: SPARK-27421 > URL: https://issues.apache.org/jira/browse/SPARK-27421 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit > Server VM, Java 1.8.0_141) >Reporter: Eric Maynard >Priority: Minor > > When running a simple query, I get the following stacktrace: > {code} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at >
[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814762#comment-16814762 ] shane knapp commented on SPARK-25079: - pyarrow 0.12.1 is out, i've updated my conda test environment at am currently waiting for the PRB to finish: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104490 > [PYTHON] upgrade python 3.4 -> 3.6 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26218) Throw exception on overflow for integers
[ https://issues.apache.org/jira/browse/SPARK-26218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814709#comment-16814709 ] Reynold Xin commented on SPARK-26218: - The no-exception is by design. Imagine you have an ETL job that runs for hours, and then it suddenly throws an exception because one row overflows ... > Throw exception on overflow for integers > > > Key: SPARK-26218 > URL: https://issues.apache.org/jira/browse/SPARK-26218 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Marco Gaido >Priority: Major > > SPARK-24598 just updated the documentation in order to state that our > addition is a Java style one and not a SQL style. But in order to follow the > SQL standard we should instead throw an exception if an overflow occurs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27435) Support schema pruning in file source V2
Gengliang Wang created SPARK-27435: -- Summary: Support schema pruning in file source V2 Key: SPARK-27435 URL: https://issues.apache.org/jira/browse/SPARK-27435 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Currently, the optimization rule `SchemaPruning` only works for Parquet/Orc V1. We should have the same optimization in file source V2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27419) When setting spark.executor.heartbeatInterval to a value less than 1 seconds, it will always fail
[ https://issues.apache.org/jira/browse/SPARK-27419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-27419. -- Resolution: Fixed Fix Version/s: 2.4.2 > When setting spark.executor.heartbeatInterval to a value less than 1 seconds, > it will always fail > - > > Key: SPARK-27419 > URL: https://issues.apache.org/jira/browse/SPARK-27419 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 2.4.1 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.2 > > > When setting spark.executor.heartbeatInterval to a value less than 1 seconds > in branch-2.4, it will always fail because the value will be converted to 0 > and the heartbeat will always timeout and finally kill the executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27434) memory leak in spark driver
[ https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryne Yang updated SPARK-27434: -- Description: we got a OOM exception on the driver after driver has completed multiple jobs(we are reusing spark context). so we took a heap dump and looked at the leak analysis, found out that under AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. can someone take a look at? here is the heap analysis: !Screen Shot 2019-04-10 at 12.11.35 PM.png! was: we got a OOM exception on the driver after driver has completed multiple jobs(we are reusing spark context). so we took a heap dump and looked at the leak analysis, found out that under AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. can someone take a look at? here is the heap analysis: !image-2019-04-10-12-24-03-184.png! > memory leak in spark driver > --- > > Key: SPARK-27434 > URL: https://issues.apache.org/jira/browse/SPARK-27434 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: OS: Centos 7 > JVM: > **_openjdk version "1.8.0_201"_ > _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_ > _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_ > Spark version: 2.4.0 >Reporter: Ryne Yang >Priority: Major > Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png > > > we got a OOM exception on the driver after driver has completed multiple > jobs(we are reusing spark context). > so we took a heap dump and looked at the leak analysis, found out that under > AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. > > can someone take a look at? > here is the heap analysis: > !Screen Shot 2019-04-10 at 12.11.35 PM.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27434) memory leak in spark driver
[ https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryne Yang updated SPARK-27434: -- Attachment: Screen Shot 2019-04-10 at 12.11.35 PM.png > memory leak in spark driver > --- > > Key: SPARK-27434 > URL: https://issues.apache.org/jira/browse/SPARK-27434 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: OS: Centos 7 > JVM: > **_openjdk version "1.8.0_201"_ > _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_ > _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_ > Spark version: 2.4.0 >Reporter: Ryne Yang >Priority: Major > Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png > > > we got a OOM exception on the driver after driver has completed multiple > jobs(we are reusing spark context). > so we took a heap dump and looked at the leak analysis, found out that under > AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. > > can someone take a look at? > here is the heap analysis: > !image-2019-04-10-12-24-03-184.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27434) memory leak in spark driver
Ryne Yang created SPARK-27434: - Summary: memory leak in spark driver Key: SPARK-27434 URL: https://issues.apache.org/jira/browse/SPARK-27434 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Environment: OS: Centos 7 JVM: **_openjdk version "1.8.0_201"_ _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_ _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_ Spark version: 2.4.0 Reporter: Ryne Yang we got a OOM exception on the driver after driver has completed multiple jobs(we are reusing spark context). so we took a heap dump and looked at the leak analysis, found out that under AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. can someone take a look at? here is the heap analysis: !image-2019-04-10-12-24-03-184.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows
Binit created SPARK-27433: - Summary: Spark Structured Streaming left outer joins returns outer nulls for already matched rows Key: SPARK-27433 URL: https://issues.apache.org/jira/browse/SPARK-27433 Project: Spark Issue Type: Question Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Binit I m basically using the example given in Spark's the documentation here: [https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking] with the built-in test stream in which one stream is ahead by 3 seconds (was originally using kafka but ran into the same issue). The results returned the match columns correctly, however after a while the same key is returned with an outer null. Is this the expected behavior? Is there a way to exclude the duplicate outer null results when there was a match? Code: {{val testStream = session.readStream.format("rate") .option("rowsPerSecond", "5").option("numPartitions", "1").load() val impressions = testStream .select( (col("value") + 15).as("impressionAdId"), col("timestamp").as("impressionTime")) val clicks = testStream .select( col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply watermarks on event-time columns val impressionsWithWatermark = impressions.withWatermark("impressionTime", "20 seconds") val clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join with event-time constraints val result = impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 10 seconds """), joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val query = result.writeStream.outputMode("update").format("console").option("truncate", false).start() query.awaitTermination()}} Result: {{--- Batch: 19 --- +--+---+-+---+ |impressionAdId|impressionTime |clickAdId|clickTime | +--+---+-+---+ |100 |2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 |2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| +--+---+-+---+ --- Batch: 57 --- +--+---+-+---+ |impressionAdId|impressionTime |clickAdId|clickTime | +--+---+-+---+ |290 |2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 |2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| |100 |2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null |null | |103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 22:18:38.562|null |null | |102 |2018-05-23 22:18:38.762|null |null | +--+---+-+---+}} {{This question is also asked in the stackoverflow. Please find the link below}} {{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}} {{ }} {{101 & 103 have already come in the join but still it is coming in the outer left join.}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27432) Spark job stuck when no jobs/stages are pending
[ https://issues.apache.org/jira/browse/SPARK-27432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajashekar updated SPARK-27432: --- Attachment: jobs.pdf stages.pdf > Spark job stuck when no jobs/stages are pending > --- > > Key: SPARK-27432 > URL: https://issues.apache.org/jira/browse/SPARK-27432 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.2.1 >Reporter: Rajashekar >Priority: Major > Attachments: jobs.pdf, stages.pdf > > > Spark job is never ending and when i check in spark UI there are no pending > jobs and no pending stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27432) Spark job stuck when no jobs/stages are pending
Rajashekar created SPARK-27432: -- Summary: Spark job stuck when no jobs/stages are pending Key: SPARK-27432 URL: https://issues.apache.org/jira/browse/SPARK-27432 Project: Spark Issue Type: Bug Components: Spark Core, Spark Submit Affects Versions: 2.2.1 Reporter: Rajashekar Spark job is never ending and when i check in spark UI there are no pending jobs and no pending stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard
[ https://issues.apache.org/jira/browse/SPARK-27423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27423: --- Assignee: Maxim Gekk > Cast DATE to/from TIMESTAMP according to SQL standard > - > > Key: SPARK-27423 > URL: https://issues.apache.org/jira/browse/SPARK-27423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > According to SQL standard, DATE is union of (year, month, day). To convert it > to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be > extended by time at midnight - (year, month, day, hour = 0, minute = 0, > seconds = 0). The former timestamp should be considered as a timestamp at the > session time zone, and transformed to microseconds since epoch in UTC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27423) Cast DATE to/from TIMESTAMP according to SQL standard
[ https://issues.apache.org/jira/browse/SPARK-27423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27423. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24332 [https://github.com/apache/spark/pull/24332] > Cast DATE to/from TIMESTAMP according to SQL standard > - > > Key: SPARK-27423 > URL: https://issues.apache.org/jira/browse/SPARK-27423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > According to SQL standard, DATE is union of (year, month, day). To convert it > to Spark's TIMESTAMP which is TIMESTAMP WITH TIME ZONE, the date should be > extended by time at midnight - (year, month, day, hour = 0, minute = 0, > seconds = 0). The former timestamp should be considered as a timestamp at the > session time zone, and transformed to microseconds since epoch in UTC. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27422) CurrentDate should return local date
[ https://issues.apache.org/jira/browse/SPARK-27422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27422: --- Assignee: Maxim Gekk > CurrentDate should return local date > > > Key: SPARK-27422 > URL: https://issues.apache.org/jira/browse/SPARK-27422 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > According to SQL standard, DATE type is union of (year, month, days), and > current date should return a triple of (year, month, days) in session local > time zone. The ticket aims to follow the requirement, and calculate a local > date for session time zone. The local date should be converted to epoch day, > and stored internally in as DATE value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27422) CurrentDate should return local date
[ https://issues.apache.org/jira/browse/SPARK-27422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27422. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24330 [https://github.com/apache/spark/pull/24330] > CurrentDate should return local date > > > Key: SPARK-27422 > URL: https://issues.apache.org/jira/browse/SPARK-27422 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > According to SQL standard, DATE type is union of (year, month, days), and > current date should return a triple of (year, month, days) in session local > time zone. The ticket aims to follow the requirement, and calculate a local > date for session time zone. The local date should be converted to epoch day, > and stored internally in as DATE value. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27412) Add a new shuffle manager to use Persistent Memory as shuffle and spilling storage
[ https://issues.apache.org/jira/browse/SPARK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27412. --- Resolution: Won't Fix > Add a new shuffle manager to use Persistent Memory as shuffle and spilling > storage > -- > > Key: SPARK-27412 > URL: https://issues.apache.org/jira/browse/SPARK-27412 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Chendi.Xue >Priority: Minor > Labels: core > Attachments: PmemShuffleManager-DesignDoc.pdf > > > Add a new shuffle manager called "PmemShuffleManager", by using which, we can > use Persistent Memory Device as storage for shuffle and external sorter > spilling. > In this implementation, we leveraged Persistent Memory Development Kit(PMDK) > to support transaction write with high performance. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27431) move HashedRelation to global UnifiedMemoryManager and enable offheap
Xiaoju Wu created SPARK-27431: - Summary: move HashedRelation to global UnifiedMemoryManager and enable offheap Key: SPARK-27431 URL: https://issues.apache.org/jira/browse/SPARK-27431 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Xiaoju Wu Why is HashedRelation currently managed by a newly created MemoryManager and disabled with offheap? Can we improve this part? /** * Create a HashedRelation from an Iterator of InternalRow. */ def apply( input: Iterator[InternalRow], key: Seq[Expression], sizeEstimate: Int = 64, taskMemoryManager: TaskMemoryManager = null): HashedRelation = { val mm = Option(taskMemoryManager).getOrElse { new TaskMemoryManager( new UnifiedMemoryManager( new SparkConf().set(MEMORY_OFFHEAP_ENABLED.key, "false"), Long.MaxValue, Long.MaxValue / 2, 1), 0) } if (key.length == 1 && key.head.dataType == LongType) { LongHashedRelation(input, key, sizeEstimate, mm) } else { UnsafeHashedRelation(input, key, sizeEstimate, mm) } } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian
[ https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade updated SPARK-27428: -- Description: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. Also I want to know, which feature of Apache Spark is tested in this test. was: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. Also I want to know, which feature of Apache Spark is tested in this this test. > Test "metrics StatsD sink with Timer " fails on BigEndian > - > > Key: SPARK-27428 > URL: https://issues.apache.org/jira/browse/SPARK-27428 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.3.4 > Environment: Working on Ubuntu16.04, Linux > Java versions : > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit > Compressed References 20190205_218 (JIT enabled, AOT enabled) > and > openjdk version "1.8.0_191" > OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) > OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) >Reporter: Anuja Jakhade >Priority: Major > > Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error > java.net.SocketTimeoutException: Receive timed out > at > java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) > at java.net.DatagramSocket.receive(DatagramSocket.java:812) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) > at > org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) > at >
[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian
[ https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade updated SPARK-27428: -- Description: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error ```java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. ``` Also I want to know, which feature of Apache Spark is tested in this this test. was: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. Also I want to know, which feature of Apache Spark is tested in this test. > Test "metrics StatsD sink with Timer " fails on BigEndian > - > > Key: SPARK-27428 > URL: https://issues.apache.org/jira/browse/SPARK-27428 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.3.4 > Environment: Working on Ubuntu16.04, Linux > Java versions : > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit > Compressed References 20190205_218 (JIT enabled, AOT enabled) > and > openjdk version "1.8.0_191" > OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) > OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) >Reporter: Anuja Jakhade >Priority: Major > > Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error > ```java.net.SocketTimeoutException: Receive timed out > at > java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) > at java.net.DatagramSocket.receive(DatagramSocket.java:812) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) > at > org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) > at >
[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian
[ https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade updated SPARK-27428: -- Description: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. Also I want to know, which feature of Apache Spark is tested in this this test. was: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error ```java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. ``` Also I want to know, which feature of Apache Spark is tested in this this test. > Test "metrics StatsD sink with Timer " fails on BigEndian > - > > Key: SPARK-27428 > URL: https://issues.apache.org/jira/browse/SPARK-27428 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.3.4 > Environment: Working on Ubuntu16.04, Linux > Java versions : > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit > Compressed References 20190205_218 (JIT enabled, AOT enabled) > and > openjdk version "1.8.0_191" > OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) > OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) >Reporter: Anuja Jakhade >Priority: Major > > Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error > java.net.SocketTimeoutException: Receive timed out > at > java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) > at java.net.DatagramSocket.receive(DatagramSocket.java:812) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) > at > org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) > at >
[jira] [Updated] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian
[ https://issues.apache.org/jira/browse/SPARK-27428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anuja Jakhade updated SPARK-27428: -- Description: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. Also I want to know, which feature of Apache Spark is tested in this this test. was: Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. > Test "metrics StatsD sink with Timer " fails on BigEndian > - > > Key: SPARK-27428 > URL: https://issues.apache.org/jira/browse/SPARK-27428 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2, 2.3.3, 2.3.4 > Environment: Working on Ubuntu16.04, Linux > Java versions : > Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit > Compressed References 20190205_218 (JIT enabled, AOT enabled) > and > openjdk version "1.8.0_191" > OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) > OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) >Reporter: Anuja Jakhade >Priority: Major > > Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error > java.net.SocketTimeoutException: Receive timed out > at > java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) > at java.net.DatagramSocket.receive(DatagramSocket.java:812) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) > at > org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) > at > org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) > On debugging observed
[jira] [Updated] (SPARK-27430) BroadcastNestedLoopJoinExec can build any side whatever the join type is
[ https://issues.apache.org/jira/browse/SPARK-27430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27430: Summary: BroadcastNestedLoopJoinExec can build any side whatever the join type is (was: BroadcastNestedLoopJoinExec should support all join types) > BroadcastNestedLoopJoinExec can build any side whatever the join type is > > > Key: SPARK-27430 > URL: https://issues.apache.org/jira/browse/SPARK-27430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27430) BroadcastNestedLoopJoinExec should support all join types
Wenchen Fan created SPARK-27430: --- Summary: BroadcastNestedLoopJoinExec should support all join types Key: SPARK-27430 URL: https://issues.apache.org/jira/browse/SPARK-27430 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27429) [SQL] to_timestamp function with additional argument flag that will allow exception if value could not be cast
t oo created SPARK-27429: Summary: [SQL] to_timestamp function with additional argument flag that will allow exception if value could not be cast Key: SPARK-27429 URL: https://issues.apache.org/jira/browse/SPARK-27429 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: t oo If I am running a SQL on a csv based dataframe and my query has to_timestamp(input_col,'-MM-dd HH:mm:ss'), if the values in input_col are not really timestamp like 'ABC' then I would like to_timestamp function to throw an exception rather than happily (silently) return the values as null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26012. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24334 [https://github.com/apache/spark/pull/24334] > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > -- > > Key: SPARK-26012 > URL: https://issues.apache.org/jira/browse/SPARK-26012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: eaton >Assignee: eaton >Priority: Major > Fix For: 3.0.0 > > > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > For example, the test bellow will fail. > test("Null and '' values should not cause dynamic partition failure of string > types") { > withTable("t1", "t2") > { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when > id = 1 then '' else null end as string) as p" + " from > t1").write.partitionBy("p").saveAsTable("t2") > checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), > Row(2, null))) } > } > The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists'. > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists: > [file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet|file:///F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet] > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > at > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) > ... 10 more > 20:43:55.460 WARN > org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26012) Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
[ https://issues.apache.org/jira/browse/SPARK-26012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26012: --- Assignee: eaton > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > -- > > Key: SPARK-26012 > URL: https://issues.apache.org/jira/browse/SPARK-26012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: eaton >Assignee: eaton >Priority: Major > > Dynamic partition will fail when both '' and null values are taken as dynamic > partition values simultaneously. > For example, the test bellow will fail. > test("Null and '' values should not cause dynamic partition failure of string > types") { > withTable("t1", "t2") > { spark.range(3).write.saveAsTable("t1") spark.sql("select id, cast(case when > id = 1 then '' else null end as string) as p" + " from > t1").write.partitionBy("p").saveAsTable("t2") > checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), > Row(2, null))) } > } > The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists'. > > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: File already > exists: > [file:/F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet|file:///F:/learning/spark/spark_master/spark_compile/spark-warehouse/t2/_temporary/0/_temporary/attempt_2018204354_0001_m_00_0/p=__HIVE_DEFAULT_PARTITION__/part-0-96217c96-3695-4f18-b0db-4f35a9078a3d.c000.snappy.parquet] > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:289) > at > org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.(ChecksumFileSystem.java:398) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461) > at > org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892) > at > org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:248) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.newOutputWriter(FileFormatDataWriter.scala:236) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionDataWriter.write(FileFormatDataWriter.scala:260) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:239) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:245) > ... 10 more > 20:43:55.460 WARN > org.apache.spark.sql.execution.datasources.FileFormatWriterSuite: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27428) Test "metrics StatsD sink with Timer " fails on BigEndian
Anuja Jakhade created SPARK-27428: - Summary: Test "metrics StatsD sink with Timer " fails on BigEndian Key: SPARK-27428 URL: https://issues.apache.org/jira/browse/SPARK-27428 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.3, 2.3.2, 2.3.4 Environment: Working on Ubuntu16.04, Linux Java versions : Eclipse OpenJ9 VM (build openj9-0.12.1, JRE 1.8.0 Linux s390x-64-Bit Compressed References 20190205_218 (JIT enabled, AOT enabled) and openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.04.1-b12) OpenJDK 64-Bit Zero VM (build 25.191-b12, interpreted mode) Reporter: Anuja Jakhade Test case "metrics StatsD sink with Timer *** FAILED ***" fails with error java.net.SocketTimeoutException: Receive timed out at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143) at java.net.DatagramSocket.receive(DatagramSocket.java:812) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:155) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4$$anonfun$apply$3.apply(StatsdSinkSuite.scala:154) at scala.collection.immutable.Range.foreach(Range.scala:160) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:154) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4$$anonfun$apply$mcV$sp$4.apply(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite.org$apache$spark$metrics$sink$StatsdSinkSuite$$withSocketAndSink(StatsdSinkSuite.scala:51) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply$mcV$sp(StatsdSinkSuite.scala:123) at org.apache.spark.metrics.sink.StatsdSinkSuite$$anonfun$4.apply(StatsdSinkSuite.scala:123) On debugging observed that the last packet is not received at "socket.receive(p)". Hence the assert fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27427) cannot change list of packages from PYSPARK_SUBMIT_ARGS after restarting context
Natalino Busa created SPARK-27427: - Summary: cannot change list of packages from PYSPARK_SUBMIT_ARGS after restarting context Key: SPARK-27427 URL: https://issues.apache.org/jira/browse/SPARK-27427 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Reporter: Natalino Busa Upon creating the gateway the env variable PYSPARK_SUBMIT_ARGS , if packages are provided the jars will be downloaded and added to the list of files to distribute to the cluster. However this mechanism only works once because the gateway is kept, when the SparkContext is stopped and created again. Possible root cause the gateway is only created once. This more aggressive SparkSession stop forces the gateway to be recreated ``` def stop(spark_session=None): try: sc = None if spark_session: sc = spark_session.sparkContext spark_session.stop() cls = pyspark.SparkContext sc = sc or cls._active_spark_context if sc: sc.stop() sc._gateway.shutdown() cls._active_spark_context = None cls._gateway = None cls._jvm = None except Exception as e: print(e) logging.warning('Could not fully stop the engine context') ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814314#comment-16814314 ] Gabor Somogyi commented on SPARK-27409: --- ping [~pbharaj] > Micro-batch support for Kafka Source in Spark 2.3 > - > > Key: SPARK-27409 > URL: https://issues.apache.org/jira/browse/SPARK-27409 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Affects Versions: 2.3.2 >Reporter: Prabhjot Singh Bharaj >Priority: Major > > It seems with this change - > [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50] > in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in > micro-batch mode but only in continuous mode. Is that understanding correct ? > {code:java} > E Py4JJavaError: An error occurred while calling o217.load. > E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549) > E at > org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314) > E at > org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > E at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > E at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > E at java.lang.reflect.Method.invoke(Method.java:498) > E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > E at py4j.Gateway.invoke(Gateway.java:282) > E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > E at py4j.commands.CallCommand.execute(CallCommand.java:79) > E at py4j.GatewayConnection.run(GatewayConnection.java:238) > E at java.lang.Thread.run(Thread.java:748) > E Caused by: org.apache.kafka.common.KafkaException: > org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: > non-existent (No such file or directory) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44) > E at > org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93) > E at > org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51) > E at > org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84) > E at > org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657) > E ... 19 more > E Caused by: org.apache.kafka.common.KafkaException: > java.io.FileNotFoundException: non-existent (No such file or directory) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121) > E at > org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41) > E ... 23 more > E Caused by: java.io.FileNotFoundException: non-existent (No such file or > directory) > E at java.io.FileInputStream.open0(Native Method) > E at java.io.FileInputStream.open(FileInputStream.java:195) > E at java.io.FileInputStream.(FileInputStream.java:138) > E at java.io.FileInputStream.(FileInputStream.java:93) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216) > E at > org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201) > E at > org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137) > E at > org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119) > E ... 24 more{code} > When running a simple data stream loader for kafka without an SSL cert, it > goes through this code block - > > {code:java} > ... > ... > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130) > E at > org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43) > E at > org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185) > ... > ...{code} > > Note that I haven't selected `trigger=continuous...` when
[jira] [Updated] (SPARK-27425) Add count_if functions
[ https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaerim Yeo updated SPARK-27425: Description: Add aggregation function which returns the number of records satisfying a given condition. For Presto, [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] function is supported, we can write concisely. However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, which looks painful. was: Add aggregation function which returns the number of records satisfying a given condition. For Presto, [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] function is supported, we can write concisely. However, Spark does not support yet, we need to write like {{count(case when some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, which looks painful. > Add count_if functions > -- > > Key: SPARK-27425 > URL: https://issues.apache.org/jira/browse/SPARK-27425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Chaerim Yeo >Priority: Minor > > Add aggregation function which returns the number of records satisfying a > given condition. > For Presto, > [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] > function is supported, we can write concisely. > However, Spark does not support yet, we need to write like {{COUNT(CASE WHEN > some_condition THEN 1 END)}} or {{SUM(CASE WHEN some_condition THEN 1 END)}}, > which looks painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27425) Add count_if functions
[ https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaerim Yeo updated SPARK-27425: Description: Add aggregation function which returns the number of records satisfying a given condition. For Presto, [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] function is supported, we can write concisely. However, Spark does not support yet, we need to write like {{count(case when some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, which looks painful. was: Add aggregation function which returns the number of records satisfying a given condition. For Presto, [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] function is supported, we can write concisely. However, Spark does not support yet, we need to write like {{count(case when some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, which looks painful. > Add count_if functions > -- > > Key: SPARK-27425 > URL: https://issues.apache.org/jira/browse/SPARK-27425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Chaerim Yeo >Priority: Minor > > Add aggregation function which returns the number of records satisfying a > given condition. > For Presto, > [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] > function is supported, we can write concisely. > However, Spark does not support yet, we need to write like {{count(case when > some_condition then 1 end)}} or {{sum(case when some_condition then 1 end)}}, > which looks painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27425) Add count_if functions
[ https://issues.apache.org/jira/browse/SPARK-27425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaerim Yeo updated SPARK-27425: Summary: Add count_if functions (was: SQL count_if functions) > Add count_if functions > -- > > Key: SPARK-27425 > URL: https://issues.apache.org/jira/browse/SPARK-27425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1 >Reporter: Chaerim Yeo >Priority: Minor > > Add aggregation function which returns the number of records satisfying a > given condition. > For Presto, > [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] > function is supported, we can write concisely. > However, Spark does not support yet, we need to write like {{count(case when > some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, > which looks painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is closed
[ https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814247#comment-16814247 ] Lantao Jin commented on SPARK-27337: {quote} This means that if you close and remake spark sessions fairly frequently, you're leaking every single time. {quote} What's the "close spark sessions" meaning here? SparkSession.close() will stop the underlying SparkContext. https://github.com/apache/spark/blob/85e5d4f141eedd571dfa0dcdabedced19736a351/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L718 > QueryExecutionListener never cleans up listeners from the bus after > SparkSession is closed > -- > > Key: SPARK-27337 > URL: https://issues.apache.org/jira/browse/SPARK-27337 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Vinoo Ganesh >Priority: Critical > Attachments: image001-1.png > > > As a result of > [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3], > it looks like there is a memory leak (specifically > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).] > > Because the Listener Bus on the context still has a reference to the listener > (even after the SparkSession is cleared), they are never cleaned up. This > means that if you close and remake spark sessions fairly frequently, you're > leaking every single time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27426) SparkAppHandle states not getting updated in Kubernetes
Nishant Ranjan created SPARK-27426: -- Summary: SparkAppHandle states not getting updated in Kubernetes Key: SPARK-27426 URL: https://issues.apache.org/jira/browse/SPARK-27426 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0 Environment: # Cent os 7 # Kubernetes 1.14 # Spark 2.4.0 Reporter: Nishant Ranjan While launching Spark application through "startApplication()" , SparkAppHandle state is not getting updated. sparkLaunch = new SparkLauncher() .setSparkHome("/root/test/spark-2.4.0-bin-hadoop2.7") .setMaster("k8s://https://172.16.23.30:6443;) .setVerbose(true) .addSparkArg("--verbose") .setAppResource("local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar") .setConf("spark.app.name","spark-pi") .setMainClass("org.apache.spark.examples.SparkPi") .setConf("spark.executor.instances","5") .setConf("spark.kubernetes.container.image","registry.renovite.com/spark:v2") .setConf("spark.kubernetes.driver.pod.name","spark-pi-driver") .setConf("spark.kubernetes.container.image.pullSecrets","dev-registry-key") .setConf("spark.kubernetes.authenticate.driver.serviceAccountName","spark") .setDeployMode("cluster") ; SparkAppHandle handle = sparkLaunch.startApplication(); Observations: # Now, I tried listeners etc but handle.getState() returns UNKNOWN and when Spark application is completed. state changes to LOST. # SparkAppHandle is not null # handle.getAppId() is always null. My best guess is that communication is not working properly between listener and Spark driver in Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27425) SQL count_if functions
Chaerim Yeo created SPARK-27425: --- Summary: SQL count_if functions Key: SPARK-27425 URL: https://issues.apache.org/jira/browse/SPARK-27425 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.1 Reporter: Chaerim Yeo Add aggregation function which returns the number of records satisfying a given condition. For Presto, [{{count_if}}|https://prestodb.github.io/docs/current/functions/aggregate.html] function is supported, we can write concisely. However, Spark does not support yet, we need to write like {{count(case when some_condition then 1)}} or {{sum(case when some_condition then 1 end)}}, which looks painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27406) UnsafeArrayData serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-27406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27406. - Resolution: Fixed Assignee: peng bo Fix Version/s: 3.0.0 2.4.2 > UnsafeArrayData serialization breaks when two machines have different Oops > size > --- > > Key: SPARK-27406 > URL: https://issues.apache.org/jira/browse/SPARK-27406 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: peng bo >Assignee: peng bo >Priority: Major > Fix For: 2.4.2, 3.0.0 > > > ApproxCountDistinctForIntervals holds the UnsafeArrayData data to initialize > endpoints. When the UnsafeArrayData is serialized with Java serialization, > the BYTE_ARRAY_OFFSET in memory can change if two machines have different > pointer width (Oops in JVM). > It's similar to SPARK-10914. > {code:java} > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals$$anonfun$endpoints$1.apply(ApproxCountDistinctForIntervals.scala:69) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints$lzycompute(ApproxCountDistinctForIntervals.scala:69) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.endpoints(ApproxCountDistinctForIntervals.scala:66) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray$lzycompute(ApproxCountDistinctForIntervals.scala:94) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$hllppArray(ApproxCountDistinctForIntervals.scala:93) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp$lzycompute(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.org$apache$spark$sql$catalyst$expressions$aggregate$ApproxCountDistinctForIntervals$$numWordsPerHllpp(ApproxCountDistinctForIntervals.scala:104) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords$lzycompute(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.totalNumWords(ApproxCountDistinctForIntervals.scala:106) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:110) > at > org.apache.spark.sql.catalyst.expressions.aggregate.ApproxCountDistinctForIntervals.createAggregationBuffer(ApproxCountDistinctForIntervals.scala:44) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.initialize(interfaces.scala:528) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator$$anonfun$initAggregationBuffer$2.apply(ObjectAggregationIterator.scala:120) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.initAggregationBuffer(ObjectAggregationIterator.scala:120) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.org$apache$spark$sql$execution$aggregate$ObjectAggregationIterator$$createNewAggregationBuffer(ObjectAggregationIterator.scala:112) > at >
[jira] [Assigned] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24872: --- Assignee: hantiantian > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: hantiantian >Priority: Minor > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24872) Remove the symbol “||” of the “OR” operation
[ https://issues.apache.org/jira/browse/SPARK-24872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24872. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 21826 [https://github.com/apache/spark/pull/21826] > Remove the symbol “||” of the “OR” operation > > > Key: SPARK-24872 > URL: https://issues.apache.org/jira/browse/SPARK-24872 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: hantiantian >Assignee: hantiantian >Priority: Minor > Fix For: 3.0.0 > > > “||” will perform the function of STRING concat, and it is also the symbol of > the "OR" operation. > When I want use "||" as "OR" operation, I find that it perform the function > of STRING concat, > spark-sql> explain extended select * from aa where id==1 || id==2; > == Parsed Logical Plan == > 'Project [*] > +- 'Filter (('id = concat(1, 'id)) = 2) > +- 'UnresolvedRelation `aa` > spark-sql> select "abc" || "DFF" ; > And the result is "abcDFF". > In predicates.scala, "||" is the symbol of "Or" operation. Could we remove it? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27414) make it clear that date type is timezone independent
[ https://issues.apache.org/jira/browse/SPARK-27414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27414. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24325 [https://github.com/apache/spark/pull/24325] > make it clear that date type is timezone independent > > > Key: SPARK-27414 > URL: https://issues.apache.org/jira/browse/SPARK-27414 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27424) Joining of one stream against the most recent update in another stream
[ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814120#comment-16814120 ] Thilo Schneider commented on SPARK-27424: - Attached is a - not fully detailed - sketch of the improvement. I would be willing to work on this further but would like to get your feedback before going into more detail. Do we need a SPIP for this proposal? > Joining of one stream against the most recent update in another stream > -- > > Key: SPARK-27424 > URL: https://issues.apache.org/jira/browse/SPARK-27424 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.1 >Reporter: Thilo Schneider >Priority: Major > Attachments: join-last-update-design.pdf > > > Currently, adding the most recent update of a row with a given key to another > stream is not possible. This situation arises if one wants to use the current > state, of one object, for example when joining the room temperature with the > current weather. > This ticket covers creation of a {{stream_lead}} and modification of the > streaming join logic (and state store) to additionally allow joins of the > form > {code:sql} > SELECT * > FROM A, B > WHERE > A.key = B.key > AND A.time >= B.time > AND A.time < stream_lead(B.time) > {code} > The major aspect of this change is that we actually need a third watermark to > cover how late updates may come. > A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27424) Joining of one stream against the most recent update in another stream
[ https://issues.apache.org/jira/browse/SPARK-27424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thilo Schneider updated SPARK-27424: Attachment: join-last-update-design.pdf > Joining of one stream against the most recent update in another stream > -- > > Key: SPARK-27424 > URL: https://issues.apache.org/jira/browse/SPARK-27424 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.1 >Reporter: Thilo Schneider >Priority: Major > Attachments: join-last-update-design.pdf > > > Currently, adding the most recent update of a row with a given key to another > stream is not possible. This situation arises if one wants to use the current > state, of one object, for example when joining the room temperature with the > current weather. > This ticket covers creation of a {{stream_lead}} and modification of the > streaming join logic (and state store) to additionally allow joins of the > form > {code:sql} > SELECT * > FROM A, B > WHERE > A.key = B.key > AND A.time >= B.time > AND A.time < stream_lead(B.time) > {code} > The major aspect of this change is that we actually need a third watermark to > cover how late updates may come. > A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27424) Joining of one stream against the most recent update in another stream
Thilo Schneider created SPARK-27424: --- Summary: Joining of one stream against the most recent update in another stream Key: SPARK-27424 URL: https://issues.apache.org/jira/browse/SPARK-27424 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.1 Reporter: Thilo Schneider Attachments: join-last-update-design.pdf Currently, adding the most recent update of a row with a given key to another stream is not possible. This situation arises if one wants to use the current state, of one object, for example when joining the room temperature with the current weather. This ticket covers creation of a {{stream_lead}} and modification of the streaming join logic (and state store) to additionally allow joins of the form {code:sql} SELECT * FROM A, B WHERE A.key = B.key AND A.time >= B.time AND A.time < stream_lead(B.time) {code} The major aspect of this change is that we actually need a third watermark to cover how late updates may come. A rough sketch may be found in the attached document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27181) Add public expression and transform API for DSv2 partitioning
[ https://issues.apache.org/jira/browse/SPARK-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27181. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24117 [https://github.com/apache/spark/pull/24117] > Add public expression and transform API for DSv2 partitioning > - > > Key: SPARK-27181 > URL: https://issues.apache.org/jira/browse/SPARK-27181 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27181) Add public expression and transform API for DSv2 partitioning
[ https://issues.apache.org/jira/browse/SPARK-27181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27181: --- Assignee: Ryan Blue > Add public expression and transform API for DSv2 partitioning > - > > Key: SPARK-27181 > URL: https://issues.apache.org/jira/browse/SPARK-27181 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27289) spark-submit explicit configuration does not take effect but Spark UI shows it's effective
[ https://issues.apache.org/jira/browse/SPARK-27289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814094#comment-16814094 ] KaiXu commented on SPARK-27289: --- I have verified that the intermediate data is written to spark.local.dir which is configured in spark-default.conf, while the value set through --conf will show on the web UI. That means the value through --conf will not override the value in spark-default.conf, what's more, the value shows on the Web UI is not the real value where it really works(the UI shows the value through --conf, but the real working dir is the value in spark-defaul.conf ). [~Udbhav Agrawal], I'm using Spark2.3.3, not know if it's the matter. > spark-submit explicit configuration does not take effect but Spark UI shows > it's effective > -- > > Key: SPARK-27289 > URL: https://issues.apache.org/jira/browse/SPARK-27289 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, Spark Submit, Web UI >Affects Versions: 2.3.3 >Reporter: KaiXu >Priority: Minor > Attachments: Capture.PNG > > > The [doc > |https://spark.apache.org/docs/latest/submitting-applications.html]says that > "In general, configuration values explicitly set on a {{SparkConf}} take the > highest precedence, then flags passed to {{spark-submit}}, then values in the > defaults file", but when setting spark.local.dir through --conf with > spark-submit, it still uses the values from > ${SPARK_HOME}/conf/spark-defaults.conf, what's more, the Spark runtime UI > environment variables shows the value from --conf, which is really misleading. > e.g. > I set submit my application through the command: > /opt/spark233/bin/spark-submit --properties-file /opt/spark.conf --conf > spark.local.dir=/tmp/spark_local -v --class > org.apache.spark.examples.mllib.SparseNaiveBayes --master > spark://bdw-slave20:7077 > /opt/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar > hdfs://bdw-slave20:8020/Bayes/Input > > the spark.local.dir in ${SPARK_HOME}/conf/spark-defaults.conf is: > spark.local.dir=/mnt/nvme1/spark_local > when the application is running, I found the intermediate shuffle data was > wrote to /mnt/nvme1/spark_local, which is set through > ${SPARK_HOME}/conf/spark-defaults.conf, but the Web UI shows that the > environment value spark.local.dir=/tmp/spark_local. > The spark-submit verbose also shows spark.local.dir=/tmp/spark_local, it's > misleading. > > !image-2019-03-27-10-59-38-377.png! > spark-submit verbose: > > Spark properties used, including those specified through > --conf and those from the properties file /opt/spark.conf: > (spark.local.dir,/tmp/spark_local) > (spark.default.parallelism,132) > (spark.driver.memory,10g) > (spark.executor.memory,352g) > X -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org