[jira] [Updated] (SPARK-24008) SQL/Hive Context fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-24008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated SPARK-24008: -- Attachment: Repro > SQL/Hive Context fails with NullPointerException > - > > Key: SPARK-24008 > URL: https://issues.apache.org/jira/browse/SPARK-24008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Priority: Major > Attachments: Repro > > > SQL / Hive Context fails with NullPointerException while getting > configuration from SQLConf. This happens when the MemoryStore is filled with > lot of broadcast and started dropping and then SQL / Hive Context is created > and broadcast. When using this Context to access a table fails with below > NullPointerException. > Repro is attached - the Spark Example which fills the MemoryStore with > broadcasts and then creates and accesses a SQL Context. > {code} > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at > org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) > at > org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) > at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) > at SparkHiveExample$.main(SparkHiveExample.scala:76) > at SparkHiveExample.main(SparkHiveExample.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: > java.lang.NullPointerException > java.lang.NullPointerException > at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) > at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) > at > org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) > > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) > > at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) > at > org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) > > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > {code} > MemoryStore got filled and started dropping the blocks. > {code} > 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in > memory (estimated size 78.1 MB, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.4 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in > memory (estimated size 350.9 KB, free 64.1 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes > in memory (estimated size 29.9 KB, free 64.0 MB) > 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in > memory (estimated size 78.1 MB, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes > in memory (estimated size 1522.0 B, free 64.7 MB) > 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 136.0 B, free 64.7 MB) > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared > 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB > 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24008) SQL/Hive Context fails with NullPointerException
Prabhu Joseph created SPARK-24008: - Summary: SQL/Hive Context fails with NullPointerException Key: SPARK-24008 URL: https://issues.apache.org/jira/browse/SPARK-24008 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.3 Reporter: Prabhu Joseph SQL / Hive Context fails with NullPointerException while getting configuration from SQLConf. This happens when the MemoryStore is filled with lot of broadcast and started dropping and then SQL / Hive Context is created and broadcast. When using this Context to access a table fails with below NullPointerException. Repro is attached - the Spark Example which fills the MemoryStore with broadcasts and then creates and accesses a SQL Context. {code} java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLConf.defaultDataSourceName(SQLConf.scala:558) at org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:362) at org.apache.spark.sql.SQLContext.read(SQLContext.scala:623) at SparkHiveExample$.main(SparkHiveExample.scala:76) at SparkHiveExample.main(SparkHiveExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) 18/04/06 14:17:42 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at org.apache.spark.sql.SQLConf.getConf(SQLConf.scala:638) at org.apache.spark.sql.SQLContext.getConf(SQLContext.scala:153) at org.apache.spark.sql.hive.HiveContext.hiveMetastoreVersion(HiveContext.scala:166) at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:258) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:255) at org.apache.spark.sql.hive.HiveContext$$anon$2.(HiveContext.scala:475) at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:475) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:474) at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:90) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) {code} MemoryStore got filled and started dropping the blocks. {code} 18/04/17 08:03:43 INFO MemoryStore: 2 blocks selected for dropping 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 78.1 MB, free 64.4 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 1522.0 B, free 64.4 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 350.9 KB, free 64.1 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 29.9 KB, free 64.0 MB) 18/04/17 08:03:43 INFO MemoryStore: 10 blocks selected for dropping 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 78.1 MB, free 64.7 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 1522.0 B, free 64.7 MB) 18/04/17 08:03:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 136.0 B, free 64.7 MB) 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared 18/04/17 08:03:20 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 18/04/17 08:03:44 INFO MemoryStore: MemoryStore cleared 18/04/17 08:03:57 INFO MemoryStore: MemoryStore started with capacity 511.1 MB 18/04/17 08:04:23 INFO MemoryStore: MemoryStore cleared {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23843) Deploy yarn meets incorrect LOCALIZED_CONF_DIR
[ https://issues.apache.org/jira/browse/SPARK-23843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441930#comment-16441930 ] Saisai Shao commented on SPARK-23843: - I think this issue is due to your "new Hadoop-compatible filesystem". I cannot reproduce with issue with HDFS. > Deploy yarn meets incorrect LOCALIZED_CONF_DIR > -- > > Key: SPARK-23843 > URL: https://issues.apache.org/jira/browse/SPARK-23843 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.3.0 > Environment: spark-2.3.0-bin-hadoop2.7 >Reporter: zhoutai.zt >Priority: Major > > We have implement a new Hadoop-compatible filesystem and run spark on it. The > commands is: > {quote}./bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn --deploy-mode cluster --executor-memory 1G --num-executors 1 > /home/hadoop/app/spark-2.3.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.0.jar > 10 > {quote} > The result is: > {quote}Exception in thread "main" org.apache.spark.SparkException: > Application application_1522399820301_0020 finishe > d with failed status > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1159) > {quote} > We set log level to DEBUG and find: > {quote}2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: > __app__.jar -> resource \{ scheme: "dfs" host: > "f-63a47d43wh98.cn-neimeng-env10-d01.dfs.aliyuncs.com" port: 10290 file: > "/user/hadoop/.sparkStaging/application_1522399820301_0006/spark-examples_2.11-2.3.0.jar" > } size: 1997548 timestamp: 1522632978000 type: FILE visibility: PRIVATE > 2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: > __spark_libs__ -> resource \{ scheme: "dfs" host: > "f-63a47d43wh98.cn-neimeng-env10-d01.dfs.aliyuncs.com" port: 10290 file: > "/user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_libs__924155631753698276.zip" > } size: 242801307 timestamp: 1522632977000 type: ARCHIVE visibility: PRIVATE > 2018-04-02 09:36:09,603 DEBUG org.apache.spark.deploy.yarn.Client: > __spark_conf__ -> resource \{ port: -1 file: > "/user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_conf__.zip" > } size: 185531 timestamp: 1522632978000 type: ARCHIVE visibility: PRIVATE > {quote} > As shown, __app__.jar and __spark_libs__ ‘s information are all correct. BUT > __spark_conf__ has no port, scheme. > We explore the source code, addResource appears two times in Client.scala > {code:java} > val destPath = copyFileToRemote(destDir, localPath, replication, symlinkCache) > val destFs = FileSystem.get(destPath.toUri(), hadoopConf) > distCacheMgr.addResource( > destFs, hadoopConf, destPath, localResources, resType, linkname, statCache, > appMasterOnly = appMasterOnly) > {code} > {code:java} > > val remoteConfArchivePath = new Path(destDir, LOCALIZED_CONF_ARCHIVE) val > remoteFs = FileSystem.get(remoteConfArchivePath.toUri(), hadoopConf) > sparkConf.set(CACHED_CONF_ARCHIVE, remoteConfArchivePath.toString()) val > localConfArchive = new Path(createConfArchive().toURI()) > copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force > = true, destName = Some(LOCALIZED_CONF_ARCHIVE)) // Manually add the config > archive to the cache manager so that the AM is launched with // the proper > files set up. > distCacheMgr.addResource( remoteFs, hadoopConf, remoteConfArchivePath, > localResources, LocalResourceType.ARCHIVE, LOCALIZED_CONF_DIR, statCache, > appMasterOnly = false) > {code} > As shown in the source code, the destPaths are differently constructed. And > this is confirmed by self added debug log > {quote}2018-04-02 15:18:46,357 ERROR > org.apache.hadoop.yarn.util.ConverterUtils: getYarnUrlFromURI > URI:/user/root/.sparkStaging/application_1522399820301_0020/__spark_conf__.zip > 2018-04-02 15:18:46,357 ERROR org.apache.hadoop.yarn.util.ConverterUtils: > getYarnUrlFromURI URL:null; > null;-1;null;/user/root/.sparkStaging/application_1522399820301_0020/__spark_conf__.zip{quote} > Log messages on YARN NM: > {quote}2018-04-02 09:36:11,958 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Failed to parse resource-request > java.net.URISyntaxException: Expected scheme name at index 0: > :///user/hadoop/.sparkStaging/application_1522399820301_0006/__spark_conf__.zip > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23340) Upgrade Apache ORC to 1.4.3
[ https://issues.apache.org/jira/browse/SPARK-23340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441926#comment-16441926 ] Apache Spark commented on SPARK-23340: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21093 > Upgrade Apache ORC to 1.4.3 > --- > > Key: SPARK-23340 > URL: https://issues.apache.org/jira/browse/SPARK-23340 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > This issue updates Apache ORC dependencies to 1.4.3 released on February 9th. > Apache ORC 1.4.2 release removes unnecessary dependencies and 1.4.3 has 5 > more patches including bug fixes (https://s.apache.org/Fll8). > Especially, the following ORC-285 is fixed at 1.4.3. > {code} > scala> val df = Seq(Array.empty[Float]).toDF() > scala> df.write.format("orc").save("/tmp/floatarray") > scala> spark.read.orc("/tmp/floatarray") > res1: org.apache.spark.sql.DataFrame = [value: array] > scala> spark.read.orc("/tmp/floatarray").show() > 18/02/12 22:09:10 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.io.IOException: Error reading file: > file:/tmp/floatarray/part-0-9c0b461b-4df1-4c23-aac1-3e4f349ac7d6-c000.snappy.orc > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1191) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78) > ... > Caused by: java.io.EOFException: Read past EOF for compressed stream Stream > for column 2 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23984: Assignee: Apache Spark > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441921#comment-16441921 ] Apache Spark commented on SPARK-23984: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/21092 > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23984: Assignee: (was: Apache Spark) > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23830) Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object
[ https://issues.apache.org/jira/browse/SPARK-23830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441919#comment-16441919 ] Saisai Shao commented on SPARK-23830: - What is the reason to use {{class}} instead of {{object}}, this seems doesn't follow our convention about writing a Spark application. I don't think we need to support this. > Spark on YARN in cluster deploy mode fail with NullPointerException when a > Spark application is a Scala class not object > > > Key: SPARK-23830 > URL: https://issues.apache.org/jira/browse/SPARK-23830 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > As reported on StackOverflow in [Why does Spark on YARN fail with “Exception > in thread ”Driver“ > java.lang.NullPointerException”?|https://stackoverflow.com/q/49564334/1305344] > the following Spark application fails with {{Exception in thread "Driver" > java.lang.NullPointerException}} with Spark on YARN in cluster deploy mode: > {code} > class MyClass { > def main(args: Array[String]): Unit = { > val c = new MyClass() > c.process() > } > def process(): Unit = { > val sparkConf = new SparkConf().setAppName("my-test") > val sparkSession: SparkSession = > SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > > } > ... > } > {code} > The exception is as follows: > {code} > 18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a > separate Thread > 18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context > initialization... > Exception in thread "Driver" java.lang.NullPointerException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) > {code} > I think the reason for the exception {{Exception in thread "Driver" > java.lang.NullPointerException}} is due to [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L700-L701]: > {code} > val mainMethod = userClassLoader.loadClass(args.userClass) > .getMethod("main", classOf[Array[String]]) > {code} > So when {{mainMethod}} is used in [the following > code|https://github.com/apache/spark/blob/v2.3.0/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L706] > it simply gives NPE. > {code} > mainMethod.invoke(null, userArgs.toArray) > {code} > That could be easily avoided with an extra check if the {{mainMethod}} is > initialized and give a user a message what may have been a reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24001) Multinode cluster
[ https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441918#comment-16441918 ] Saisai Shao commented on SPARK-24001: - Question should go to mail list. > Multinode cluster > -- > > Key: SPARK-24001 > URL: https://issues.apache.org/jira/browse/SPARK-24001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Direselign >Priority: Major > Attachments: Screenshot from 2018-04-17 22-47-39.png > > > I was trying to configure Apache spark cluster on two ubuntu 16.04 machines > using yarn but it is not working when I submit a task to the clusters -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24001) Multinode cluster
[ https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24001. - Resolution: Invalid > Multinode cluster > -- > > Key: SPARK-24001 > URL: https://issues.apache.org/jira/browse/SPARK-24001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Direselign >Priority: Major > Attachments: Screenshot from 2018-04-17 22-47-39.png > > > I was trying to configure Apache spark cluster on two ubuntu 16.04 machines > using yarn but it is not working when I submit a task to the clusters -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
[ https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24007: Labels: correctness (was: ) > EqualNullSafe for FloatType and DoubleType might generate a wrong result by > codegen. > > > Key: SPARK-24007 > URL: https://issues.apache.org/jira/browse/SPARK-24007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Labels: correctness > > {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong > result by codegen. > {noformat} > scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() > df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] > scala> df.show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > scala> df.filter("_1 <=> _2").show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > {noformat} > The result should be empty but the result remains two rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
[ https://issues.apache.org/jira/browse/SPARK-24007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24007: --- Assignee: Takuya Ueshin > EqualNullSafe for FloatType and DoubleType might generate a wrong result by > codegen. > > > Key: SPARK-24007 > URL: https://issues.apache.org/jira/browse/SPARK-24007 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > > {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong > result by codegen. > {noformat} > scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() > df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] > scala> df.show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > scala> df.filter("_1 <=> _2").show() > +++ > | _1| _2| > +++ > |-1.0|null| > |null|-1.0| > +++ > {noformat} > The result should be empty but the result remains two rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
Xianjin YE created SPARK-24006: -- Summary: ExecutorAllocationManager.onExecutorAdded is an O(n) operation Key: SPARK-24006 URL: https://issues.apache.org/jira/browse/SPARK-24006 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Xianjin YE The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe it will be a problem when scaling out with large number of Executors as it effectively makes adding N executors at time complexity O(N^2). I propose to invoke onExecutorIdle guarded by {code:java} if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { // Since we only need to re-remark idle executors when low bound executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) }{code} cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24007) EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen.
Takuya Ueshin created SPARK-24007: - Summary: EqualNullSafe for FloatType and DoubleType might generate a wrong result by codegen. Key: SPARK-24007 URL: https://issues.apache.org/jira/browse/SPARK-24007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1, 2.1.2, 2.0.2 Reporter: Takuya Ueshin {{EqualNullSafe}} for {{FloatType}} and {{DoubleType}} might generate a wrong result by codegen. {noformat} scala> val df = Seq((Some(-1.0d), None), (None, Some(-1.0d))).toDF() df: org.apache.spark.sql.DataFrame = [_1: double, _2: double] scala> df.show() +++ | _1| _2| +++ |-1.0|null| |null|-1.0| +++ scala> df.filter("_1 <=> _2").show() +++ | _1| _2| +++ |-1.0|null| |null|-1.0| +++ {noformat} The result should be empty but the result remains two rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianjin YE updated SPARK-24006: --- Description: The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe it will be a problem when scaling out with large number of Executors as it effectively makes adding N executors at time complexity O(N^2). I propose to invoke onExecutorIdle guarded by {code:java} if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { // Since we only need to re-remark idle executors when low bound executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) } else { onExecutorIdle(executorId) }{code} cc [~zsxwing] was: The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I believe it will be a problem when scaling out with large number of Executors as it effectively makes adding N executors at time complexity O(N^2). I propose to invoke onExecutorIdle guarded by {code:java} if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { // Since we only need to re-remark idle executors when low bound executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) }{code} cc [~zsxwing] > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23989) When using `SortShuffleWriter`, the data will be overwritten
[ https://issues.apache.org/jira/browse/SPARK-23989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441906#comment-16441906 ] Saisai Shao commented on SPARK-23989: - Please provide a reproducible case. Did you reuse the object in your code? I think in the Spark side we already handled such case. > When using `SortShuffleWriter`, the data will be overwritten > > > Key: SPARK-23989 > URL: https://issues.apache.org/jira/browse/SPARK-23989 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Priority: Critical > > {color:#33}When using `SortShuffleWriter`, we only insert > '{color}{color:#cc7832}AnyRef{color}{color:#33}' into > '{color}PartitionedAppendOnlyMap{color:#33}' or > '{color}PartitionedPairBuffer{color:#33}'.{color} > {color:#33}For this function:{color} > {color:#cc7832}override def {color}{color:#ffc66d}write{color}(records: > {color:#4e807d}Iterator{color}[Product2[{color:#4e807d}K{color}{color:#cc7832}, > {color}{color:#4e807d}V{color}]]) > the value of 'records' is `UnsafeRow`, so the value will be overwritten > {color:#33} {color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23982) NoSuchMethodException: There is no startCredentialUpdater method in the object YarnSparkHadoopUtil
[ https://issues.apache.org/jira/browse/SPARK-23982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441881#comment-16441881 ] Saisai Shao commented on SPARK-23982: - This method should be existed. Would you please paste your exception stack here (if you really met the issue)? > NoSuchMethodException: There is no startCredentialUpdater method in the > object YarnSparkHadoopUtil > -- > > Key: SPARK-23982 > URL: https://issues.apache.org/jira/browse/SPARK-23982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: John >Priority: Major > > In the 219 line of the CoarseGrainedExecutorBackend class: > Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil").getMethod("startCredentialUpdater", > classOf[SparkConf]).invoke(null, driverConf) > But, There is no startCredentialUpdater method in the object > YarnSparkHadoopUtil. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441864#comment-16441864 ] Weichen Xu commented on SPARK-7132: --- I dicussed with [~josephkb] and paste the proposal on JIRA. [~yanboliang] Do you agree with it or do you have other thoughts ? > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. > Goals > A [P0] Support efficient validation during training > B [P1] Support early stopping based on validation metrics > C [P0] Ensure validation data are preprocessed identically to training data > D [P1] Support complex Pipelines with multiple models using validation data > Proposal: column with indicator for train vs validation > Include an extra column in the input DataFrame which indicates whether the > row is for training or validation. Add a Param “validationFlagCol” used to > specify the extra column name. > A, B, C are easy. > D is doable. > Each estimator would need to have its validationFlagCol Param set to the same > column. > Complication: It would be ideal if we could prevent different estimators from > using different validation sets. (Joseph: There is not an obvious way IMO. > Maybe we can address this later by, e.g., having Pipelines take a > validationFlagCol Param and pass that to the sub-models in the Pipeline. > Let’s not worry about this for now.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-7132: -- Description: In spark.mllib GradientBoostedTrees, we have a method runWithValidation which takes a validation set. We should add that to the spark.ml API. This will require a bit of thinking about how the Pipelines API should handle a validation set (since Transformers and Estimators only take 1 input DataFrame). The current plan is to include an extra column in the input DataFrame which indicates whether the row is for training, validation, etc. Goals A [P0] Support efficient validation during training B [P1] Support early stopping based on validation metrics C [P0] Ensure validation data are preprocessed identically to training data D [P1] Support complex Pipelines with multiple models using validation data Proposal: column with indicator for train vs validation Include an extra column in the input DataFrame which indicates whether the row is for training or validation. Add a Param “validationFlagCol” used to specify the extra column name. A, B, C are easy. D is doable. Each estimator would need to have its validationFlagCol Param set to the same column. Complication: It would be ideal if we could prevent different estimators from using different validation sets. (Joseph: There is not an obvious way IMO. Maybe we can address this later by, e.g., having Pipelines take a validationFlagCol Param and pass that to the sub-models in the Pipeline. Let’s not worry about this for now.) was: In spark.mllib GradientBoostedTrees, we have a method runWithValidation which takes a validation set. We should add that to the spark.ml API. This will require a bit of thinking about how the Pipelines API should handle a validation set (since Transformers and Estimators only take 1 input DataFrame). The current plan is to include an extra column in the input DataFrame which indicates whether the row is for training, validation, etc. > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. > Goals > A [P0] Support efficient validation during training > B [P1] Support early stopping based on validation metrics > C [P0] Ensure validation data are preprocessed identically to training data > D [P1] Support complex Pipelines with multiple models using validation data > Proposal: column with indicator for train vs validation > Include an extra column in the input DataFrame which indicates whether the > row is for training or validation. Add a Param “validationFlagCol” used to > specify the extra column name. > A, B, C are easy. > D is doable. > Each estimator would need to have its validationFlagCol Param set to the same > column. > Complication: It would be ideal if we could prevent different estimators from > using different validation sets. (Joseph: There is not an obvious way IMO. > Maybe we can address this later by, e.g., having Pipelines take a > validationFlagCol Param and pass that to the sub-models in the Pipeline. > Let’s not worry about this for now.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.
[ https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23341: --- Assignee: Wenchen Fan > DataSourceOptions should handle path and table names to avoid confusion. > > > Key: SPARK-23341 > URL: https://issues.apache.org/jira/browse/SPARK-23341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > DataSourceOptions should have getters for path and table, which should be > passed in when creating the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.
[ https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23341. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20535 [https://github.com/apache/spark/pull/20535] > DataSourceOptions should handle path and table names to avoid confusion. > > > Key: SPARK-23341 > URL: https://issues.apache.org/jira/browse/SPARK-23341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Fix For: 2.4.0 > > > DataSourceOptions should have getters for path and table, which should be > passed in when creating the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK
[ https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440841#comment-16440841 ] Brahma Reddy Battula edited comment on SPARK-24000 at 4/18/18 3:41 AM: --- Discussed [~ste...@apache.org] with offline, we feel , create table code should be thinking of doing a getFileStatus() on the path to see what's there and so fail on any no-permissions situation. Please forgive me,if I am not clear. I am a newbie to spark. [~ste...@apache.org] can you please pitch in here.? was (Author: brahmareddy): Discussed [~ste...@apache.org] with offline, create table code should be thinking of doing a getFileStatus() on the path to see what's there and so fail on any no-permissions situation. Please forgive me,if I am not clear. I am a newbie to spark. > S3A: Create Table should fail on invalid AK/SK > -- > > Key: SPARK-24000 > URL: https://issues.apache.org/jira/browse/SPARK-24000 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Brahma Reddy Battula >Priority: Major > > Currently, When we pass the i{color:#FF}nvalid ak&{color} *create > table* will be the *success*. > when the S3AFileSystem is initialized, *verifyBucketExists*() is called, > which will return *True* as the status code 403 > (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_* _from following as bucket exists._ > {code:java} > public boolean doesBucketExist(String bucketName) > throws AmazonClientException, AmazonServiceException { > > try { > headBucket(new HeadBucketRequest(bucketName)); > return true; > } catch (AmazonServiceException ase) { > // A redirect error or a forbidden error means the bucket exists. So > // returning true. > if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE) > || (ase.getStatusCode() == > Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) { > return true; > } > if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) { > return false; > } > throw ase; > > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
[ https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441849#comment-16441849 ] Apache Spark commented on SPARK-22676: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/21091 > Avoid iterating all partition paths when > spark.sql.hive.verifyPartitionPath=true > > > Key: SPARK-22676 > URL: https://issues.apache.org/jira/browse/SPARK-22676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > In current code, it will scanning all partition paths when > spark.sql.hive.verifyPartitionPath=true. > e.g. table like below: > CREATE TABLE `test`( > `id` int, > `age` int, > `name` string) > PARTITIONED BY ( > `A` string, > `B` string) > load data local inpath '/tmp/data1' into table test partition(A='00', B='00') > load data local inpath '/tmp/data1' into table test partition(A='01', B='01') > load data local inpath '/tmp/data1' into table test partition(A='10', B='10') > load data local inpath '/tmp/data1' into table test partition(A='11', B='11') > If I query with SQL -- "select * from test where year=2017 and month=12 and > day=03", current code will scan all partition paths including > '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', > '/data/A=11/B=11'. It costs much time and memory cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441826#comment-16441826 ] Jordan Moore commented on SPARK-18057: -- Based on my Github searches, looks like it's Spark 2.1.1, possibly upgradable to at most 2.2.0 based on the parent POM available. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441818#comment-16441818 ] Cody Koeninger commented on SPARK-18057: Ok, if you can figure out what version of spark it is I can help w getting a branch with the updated dependency. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441802#comment-16441802 ] Richard Yu edited comment on SPARK-18057 at 4/18/18 2:46 AM: - Kafka contributors / developers are currently still discussing over [KIP-266|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=75974886], so we don't have any working models for the hanging position() calls just yet. was (Author: yohan123): Kafka contributors / developers are currently still discussing over KIP-266, so we don't have any working models for the hanging position() calls just yet. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441802#comment-16441802 ] Richard Yu commented on SPARK-18057: Kafka contributors / developers are currently still discussing over KIP-266, so we don't have any working models for the hanging position() calls just yet. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441803#comment-16441803 ] Jordan Moore commented on SPARK-18057: -- {quote}probably won't work {quote} I figured as much. Nope, not actively deleting topics myself. I can't control what others are doing, of course. {quote}[change] just the app assembly ... [I] can try out {quote} I should mention that I am not the actual one running the code, otherwise I would offer more logs. I work on the infrastructure team, and I believe the other team with the code is using Qubole's cloud managed Spark environment or possibly EMR. I know where the assembly jar is on our HDFS cluster, not really sure how that all is managed in AWS. I doubt it's running Spark 2.3.0, though. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns
[ https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-21479: --- Assignee: Maryann Xue > Outer join filter pushdown in null supplying table when condition is on one > of the joined columns > - > > Key: SPARK-21479 > URL: https://issues.apache.org/jira/browse/SPARK-21479 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Abhijit Bhole >Assignee: Maryann Xue >Priority: Major > Fix For: 2.4.0 > > > Here are two different query plans - > {code:java} > df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}]) > df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": > 5, "c" : 8}]) > df1.join(df2, ['a'], 'right_outer').where("b = 2").explain() > == Physical Plan == > *Project [a#16299L, b#16295L, c#16300L] > +- *SortMergeJoin [a#16294L], [a#16299L], Inner >:- *Sort [a#16294L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(a#16294L, 4) >: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && > isnotnull(a#16294L)) >:+- Scan ExistingRDD[a#16294L,b#16295L] >+- *Sort [a#16299L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#16299L, 4) > +- *Filter isnotnull(a#16299L) > +- Scan ExistingRDD[a#16299L,c#16300L] > df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}]) > df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": > 5, "c" : 8}]) > df1.join(df2, ['a'], 'right_outer').where("a = 1").explain() > == Physical Plan == > *Project [a#16314L, b#16310L, c#16315L] > +- SortMergeJoin [a#16309L], [a#16314L], RightOuter >:- *Sort [a#16309L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(a#16309L, 4) >: +- Scan ExistingRDD[a#16309L,b#16310L] >+- *Sort [a#16314L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#16314L, 4) > +- *Filter (isnotnull(a#16314L) && (a#16314L = 1)) > +- Scan ExistingRDD[a#16314L,c#16315L] > {code} > If condition on b can be pushed down on df1 then why not condition on a? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns
[ https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-21479. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20816 [https://github.com/apache/spark/pull/20816] > Outer join filter pushdown in null supplying table when condition is on one > of the joined columns > - > > Key: SPARK-21479 > URL: https://issues.apache.org/jira/browse/SPARK-21479 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.1.0, 2.1.1, 2.2.0 >Reporter: Abhijit Bhole >Priority: Major > Fix For: 2.4.0 > > > Here are two different query plans - > {code:java} > df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}]) > df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": > 5, "c" : 8}]) > df1.join(df2, ['a'], 'right_outer').where("b = 2").explain() > == Physical Plan == > *Project [a#16299L, b#16295L, c#16300L] > +- *SortMergeJoin [a#16294L], [a#16299L], Inner >:- *Sort [a#16294L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(a#16294L, 4) >: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && > isnotnull(a#16294L)) >:+- Scan ExistingRDD[a#16294L,b#16295L] >+- *Sort [a#16299L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#16299L, 4) > +- *Filter isnotnull(a#16299L) > +- Scan ExistingRDD[a#16299L,c#16300L] > df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}]) > df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": > 5, "c" : 8}]) > df1.join(df2, ['a'], 'right_outer').where("a = 1").explain() > == Physical Plan == > *Project [a#16314L, b#16310L, c#16315L] > +- SortMergeJoin [a#16309L], [a#16314L], RightOuter >:- *Sort [a#16309L ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(a#16309L, 4) >: +- Scan ExistingRDD[a#16309L,b#16310L] >+- *Sort [a#16314L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(a#16314L, 4) > +- *Filter (isnotnull(a#16314L) && (a#16314L = 1)) > +- Scan ExistingRDD[a#16314L,c#16315L] > {code} > If condition on b can be pushed down on df1 then why not condition on a? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441793#comment-16441793 ] Cody Koeninger commented on SPARK-18057: Just adding the extra dependency on 0.11 probably won't work, since there were some actual code changes not just a dependency bump. As long as you aren't deleting topics, the blocker KAFKA-4879 issue shouldn't affect you trying out a branch with the changes, and it wouldn't require changing your spark deployment, just the app assembly. [~sliebau] [~Yohan123] do you have an up to date branch of spark with the kafka client dependency change that Jordan can try out? If not I can make one. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger resolved SPARK-22968. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21038 [https://github.com/apache/spark/pull/21038] > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Assignee: Saisai Shao >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at >
[jira] [Assigned] (SPARK-22968) java.lang.IllegalStateException: No current assignment for partition kssh-2
[ https://issues.apache.org/jira/browse/SPARK-22968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger reassigned SPARK-22968: -- Assignee: Saisai Shao > java.lang.IllegalStateException: No current assignment for partition kssh-2 > --- > > Key: SPARK-22968 > URL: https://issues.apache.org/jira/browse/SPARK-22968 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Kafka: 0.10.0 (CDH5.12.0) > Apache Spark 2.1.1 > Spark streaming+Kafka >Reporter: Jepson >Assignee: Saisai Shao >Priority: Major > Original Estimate: 96h > Remaining Estimate: 96h > > *Kafka Broker:* > {code:java} >message.max.bytes : 2621440 > {code} > *Spark Streaming+Kafka Code:* > {code:java} > , "max.partition.fetch.bytes" -> (5242880: java.lang.Integer) //default: > 1048576 > , "request.timeout.ms" -> (9: java.lang.Integer) //default: 6 > , "session.timeout.ms" -> (6: java.lang.Integer) //default: 3 > , "heartbeat.interval.ms" -> (5000: java.lang.Integer) > , "receive.buffer.bytes" -> (10485760: java.lang.Integer) > {code} > *Error message:* > {code:java} > 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously > assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, > kssh-1] for group use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined > group use_a_separate_group_id_for_each_stream with generation 4 > 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned > partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group > use_a_separate_group_id_for_each_stream > 18/01/05 09:48:27 ERROR scheduler.JobScheduler: Error generating jobs for > time 1515116907000 ms > java.lang.IllegalStateException: No current assignment for partition kssh-2 > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231) > at > org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:197) > at > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:214) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117) > at > org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > at > org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:249) > at > org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247) > at scala.util.Try$.apply(Try.scala:192) >
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441718#comment-16441718 ] Jordan Moore edited comment on SPARK-18057 at 4/18/18 12:53 AM: Hello Cody, No it is not. And no other specific configurations are applied to the topic. I have reason to believe the problem is not Spark specific (other than maybe the committed offset management by a producer). On the consumer side, it seems to be an issue in just the Java client because that offset output given is from a plain Java consumer. All versions of kafka-clients 0.10.x demonstrated the same behavior. was (Author: cricket007): Hello Cody, No it is not. And no other specific configurations are applied to the topic. I have reason to believe the problem is not Spark specific. It is an issue in just the Java client because that offset output given is from a plain Java consumer. All versions of kafka-clients 0.10.x demonstrated the same behavior. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441718#comment-16441718 ] Jordan Moore commented on SPARK-18057: -- Hello Cody, No it is not. And no other specific configurations are applied to the topic. I have reason to believe the problem is not Spark specific. It is an issue in just the Java client because that offset output given is from a plain Java consumer. All versions of kafka-clients 0.10.x demonstrated the same behavior. > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441713#comment-16441713 ] Joseph K. Bradley commented on SPARK-18693: --- [~imatiach] Would you mind creating JIRA subtasks so that we have 1 PR per JIRA? That helps with tracking. Thanks! > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441701#comment-16441701 ] Joseph K. Bradley commented on SPARK-23990: --- A complication was brought up by this PR: Some logging occurs in classes which are not Estimators (WeightedLeastSquares, IterativelyReweightedLeastSquares) and in static objects (RandomForest, GradientBoostedTrees). These may have an Instrumentation instance available (when used from an Estimator) or may not (when used in a unit test). Options include: 1. Make these require Instrumentation instances. This would require slightly awkward changes to unit tests. 2. Create something similar to Instrumentation or Logging which can store an Optional Instrumentation instance. If the Instrumentation is available, it can log via that; otherwise, it can call into regular Logging. 2a. This could be a trait like Logging. This is nice in that it requires fewer changes to existing logging code. 2b. This could be a class like Instrumentation. This is nice in that it standardizes all of MLlib around Instrumentation instead of Logging. I'd vote for 2b to standardize what we do in MLlib. Thoughts? > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Priority: Major > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441606#comment-16441606 ] Cody Koeninger commented on SPARK-18057: Out of curiosity, was that a compacted topic? > Update structured streaming kafka from 0.10.0.1 to 1.1.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24005) Remove usage of Scala’s parallel collection
Xiao Li created SPARK-24005: --- Summary: Remove usage of Scala’s parallel collection Key: SPARK-24005 URL: https://issues.apache.org/jira/browse/SPARK-24005 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.3.0 Reporter: Xiao Li {noformat} val par = (1 to 100).par.flatMap { i => Thread.sleep(1000) 1 to 1000 }.toSeq {noformat} We are unable to interrupt the execution of parallel collections. We need to create a common utility function to do it, instead of using Scala parallel collections -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-23948: - Fix Version/s: 2.3.1 > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 1.1.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441485#comment-16441485 ] Jordan Moore commented on SPARK-18057: -- Hi all, chiming in here to point out a production issue we are currently seeing. We recently upgraded from Confluent 3.1.2 (Kafka 0.10.1.1) to Confluent 3.3.1 (Kafka 0.11.0.1), and seeing messages such as {code:java} Tried to fetch 473151075 but the returned record offset was 473151072{code} Full Stacktrace below So, looking into the raw topic with a 0.10.x Java consumer, we see that there are duplicated offsets (see those ending in 72-74 below), however when we deserialize the messages, the record values are actually different. {code:java} offset = 473151070, timestamp=1523743598312, data=[B@5ae9a829 offset = 473151071, timestamp=1523743598211, data=[B@6d8a00e3 offset = 473151072, timestamp=1523743598213, data=[B@548b7f67 offset = 473151073, timestamp=1523743598215, data=[B@7ac7a4e4 offset = 473151074, timestamp=1523743598423, data=[B@6d78f375 offset = 473151072, timestamp=1523743598831, data=[B@50c87b21 offset = 473151073, timestamp=1523743598837, data=[B@5f375618 offset = 473151074, timestamp=1523743599017, data=[B@1810399e offset = 473151075, timestamp=1523743599020, data=[B@32d992b2 offset = 473151076, timestamp=1523743599710, data=[B@215be6bb offset = 473151077, timestamp=1523743599714, data=[B@4439f31e{code} Running the same simple consumer with at least Kafka 0.11.x libraries *fixed* the issue. So, my question here is - even if there isn't a big push towards jumping all the way onto 1.1.0, what about simply upgrading to at least 0.11? Or how safe is it to just add {{org.apache.kafka:kafka-clients:0.11.0.0 }}? Full Stacktrace... {code:java} GScheduler: ResultStage 0 (start at SparkStreamingTask.java:222) failed in 77.546 s due to Job aborted due to stage failure: Task 86 in stage 0.0 failed 4 times, most recent failure: Lost task 86.3 in stage 0.0 (TID 96, ip-10-120-12-52.ec2.internal, executor 11): java.lang.IllegalStateException: Tried to fetch 473151075 but the returned record offset was 473151072 at org.apache.spark.sql.kafka010.CachedKafkaConsumer.fetchData(CachedKafkaConsumer.scala:234) at org.apache.spark.sql.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:106) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:158) at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:149) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) at com.domain.spark.KafkaConnectorTask.lambda$run$97bfadb$1(KafkaConnectorTask.java:81) at org.apache.spark.sql.Dataset$$anonfun$48.apply(Dataset.scala:2269) at org.apache.spark.sql.Dataset$$anonfun$48.apply(Dataset.scala:2269) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:196) at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:193) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} > Update structured streaming kafka from 0.10.0.1 to 1.1.0 >
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441448#comment-16441448 ] Apache Spark commented on SPARK-15784: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/21090 > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441378#comment-16441378 ] Bruce Robbins commented on SPARK-23963: --- [~Tagar] Yes, although I am a little fuzzy on the process for back porting. I will try to find out. > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24004) Tests of from_json for MapType
[ https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441369#comment-16441369 ] Apache Spark commented on SPARK-24004: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/21089 > Tests of from_json for MapType > -- > > Key: SPARK-24004 > URL: https://issues.apache.org/jira/browse/SPARK-24004 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Trivial > > There are no tests for *from_json* that check *MapType* as a value type of > struct fields. The MapType should be supported as non-root type according to > current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24004) Tests of from_json for MapType
[ https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24004: Assignee: Apache Spark > Tests of from_json for MapType > -- > > Key: SPARK-24004 > URL: https://issues.apache.org/jira/browse/SPARK-24004 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Trivial > > There are no tests for *from_json* that check *MapType* as a value type of > struct fields. The MapType should be supported as non-root type according to > current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24004) Tests of from_json for MapType
[ https://issues.apache.org/jira/browse/SPARK-24004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24004: Assignee: (was: Apache Spark) > Tests of from_json for MapType > -- > > Key: SPARK-24004 > URL: https://issues.apache.org/jira/browse/SPARK-24004 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Maxim Gekk >Priority: Trivial > > There are no tests for *from_json* that check *MapType* as a value type of > struct fields. The MapType should be supported as non-root type according to > current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24004) Tests of from_json for MapType
Maxim Gekk created SPARK-24004: -- Summary: Tests of from_json for MapType Key: SPARK-24004 URL: https://issues.apache.org/jira/browse/SPARK-24004 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.3.0 Reporter: Maxim Gekk There are no tests for *from_json* that check *MapType* as a value type of struct fields. The MapType should be supported as non-root type according to current implementation of JacksonParser but the functionality is not checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441353#comment-16441353 ] Miao Wang commented on SPARK-15784: --- [~josephkb] You can start the new PR now. :) > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's
[ https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24003: Assignee: (was: Apache Spark) > Add support to provide spark.executor.extraJavaOptions in terms of App Id > and/or Executor Id's > -- > > Key: SPARK-24003 > URL: https://issues.apache.org/jira/browse/SPARK-24003 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Devaraj K >Priority: Major > > Users may want to enable gc logging or heap dump for the executors, but there > is a chance of overwriting it by other executors since the paths cannot be > expressed dynamically. This improvement would enable to express the > spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to > avoid the overwriting by other executors. > There was a discussion about this in SPARK-3767, but it never fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's
[ https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24003: Assignee: Apache Spark > Add support to provide spark.executor.extraJavaOptions in terms of App Id > and/or Executor Id's > -- > > Key: SPARK-24003 > URL: https://issues.apache.org/jira/browse/SPARK-24003 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Devaraj K >Assignee: Apache Spark >Priority: Major > > Users may want to enable gc logging or heap dump for the executors, but there > is a chance of overwriting it by other executors since the paths cannot be > expressed dynamically. This improvement would enable to express the > spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to > avoid the overwriting by other executors. > There was a discussion about this in SPARK-3767, but it never fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's
[ https://issues.apache.org/jira/browse/SPARK-24003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441328#comment-16441328 ] Apache Spark commented on SPARK-24003: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/21088 > Add support to provide spark.executor.extraJavaOptions in terms of App Id > and/or Executor Id's > -- > > Key: SPARK-24003 > URL: https://issues.apache.org/jira/browse/SPARK-24003 > Project: Spark > Issue Type: Improvement > Components: Mesos, Spark Core, YARN >Affects Versions: 2.3.0 >Reporter: Devaraj K >Priority: Major > > Users may want to enable gc logging or heap dump for the executors, but there > is a chance of overwriting it by other executors since the paths cannot be > expressed dynamically. This improvement would enable to express the > spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to > avoid the overwriting by other executors. > There was a discussion about this in SPARK-3767, but it never fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22884: -- Shepherd: Joseph K. Bradley > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24003) Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's
Devaraj K created SPARK-24003: - Summary: Add support to provide spark.executor.extraJavaOptions in terms of App Id and/or Executor Id's Key: SPARK-24003 URL: https://issues.apache.org/jira/browse/SPARK-24003 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core, YARN Affects Versions: 2.3.0 Reporter: Devaraj K Users may want to enable gc logging or heap dump for the executors, but there is a chance of overwriting it by other executors since the paths cannot be expressed dynamically. This improvement would enable to express the spark.executor.extraJavaOptions paths in terms of App Id and Executor Id's to avoid the overwriting by other executors. There was a discussion about this in SPARK-3767, but it never fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441268#comment-16441268 ] Kazuaki Ishizaki commented on SPARK-23933: -- ping [~smilegator] > High-order function: map(array, array) → map> --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441257#comment-16441257 ] Carlos Bribiescas commented on SPARK-21063: --- Any update or workarounds for this? > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8799: - Shepherd: Joseph K. Bradley > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441207#comment-16441207 ] Joseph K. Bradley commented on SPARK-8799: -- The missing functionality was added in [SPARK-9312], but we cannot fix this JIRA until 3.0.0 since it will require breaking APIs (changing OneVsRest's inheritance structure and supported FeatureTypes). Let's target this fix for 3.0.0, for which I'll recommend: * Rename current OneVsRest to GenericOneVsRest or something like that. Have it inherit from Classifier and take a type parameter for FeaturesType. * Add a specialization of GenericOneVsRest with fixed FeaturesType = VectorUDT, and call this new one OneVsRest. I _think_ that will avoid breaking most user code (but I have not thought it through carefully). > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8799: - Target Version/s: 3.0.0 > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21741: - Assignee: Weichen Xu > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21741. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20695 [https://github.com/apache/spark/pull/20695] > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23997) Configurable max number of buckets
[ https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441146#comment-16441146 ] Apache Spark commented on SPARK-23997: -- User 'ferdonline' has created a pull request for this issue: https://github.com/apache/spark/pull/21087 > Configurable max number of buckets > -- > > Key: SPARK-23997 > URL: https://issues.apache.org/jira/browse/SPARK-23997 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Fernando Pereira >Priority: Major > > When exporting data as a table the user can choose to split data in buckets > by choosing the columns and the number of buckets. Currently there is a > hard-coded limit of 99'999 buckets. > However, for heavy workloads this limit might be too restrictive, a situation > that will eventually become more common as workloads grow. > As per the comments in SPARK-19618 this limit could be made configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23997) Configurable max number of buckets
[ https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23997: Assignee: Apache Spark > Configurable max number of buckets > -- > > Key: SPARK-23997 > URL: https://issues.apache.org/jira/browse/SPARK-23997 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Fernando Pereira >Assignee: Apache Spark >Priority: Major > > When exporting data as a table the user can choose to split data in buckets > by choosing the columns and the number of buckets. Currently there is a > hard-coded limit of 99'999 buckets. > However, for heavy workloads this limit might be too restrictive, a situation > that will eventually become more common as workloads grow. > As per the comments in SPARK-19618 this limit could be made configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23997) Configurable max number of buckets
[ https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23997: Assignee: (was: Apache Spark) > Configurable max number of buckets > -- > > Key: SPARK-23997 > URL: https://issues.apache.org/jira/browse/SPARK-23997 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Fernando Pereira >Priority: Major > > When exporting data as a table the user can choose to split data in buckets > by choosing the columns and the number of buckets. Currently there is a > hard-coded limit of 99'999 buckets. > However, for heavy workloads this limit might be too restrictive, a situation > that will eventually become more common as workloads grow. > As per the comments in SPARK-19618 this limit could be made configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23986: --- Assignee: Marco Gaido > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23986. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21080 [https://github.com/apache/spark/pull/21080] > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0, 2.3.1 > > Attachments: spark-generated.java > > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name > conflicts in the generated code: {{agg_expr_11.}} > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23999) Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our production environment?
[ https://issues.apache.org/jira/browse/SPARK-23999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23999. Resolution: Invalid Fix Version/s: (was: 2.3.0) (was: 3.0.0) Please use the mailing lists for asking questions. http://spark.apache.org/community.html > Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our > production environment? > --- > > Key: SPARK-23999 > URL: https://issues.apache.org/jira/browse/SPARK-23999 > Project: Spark > Issue Type: Question > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.3.0 >Reporter: Prabhu Bentick >Priority: Major > Labels: features > Original Estimate: 336h > Remaining Estimate: 336h > > Hi Team, > Understand from one of developer working in Apache Spark that Spark SQL shell > is not a stable one & it is not recomended to use Spark SQL Shell in > production environment, Is it true? Apache Spark 3.0 version going to release > the stable version. > Can anyone please clarify. > Regards > Prabhu Bentick -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24002: Assignee: Xiao Li (was: Apache Spark) > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > Having two queries one is a 1000-line SQL query and a 3000-line SQL query. > Need to run at least one hour with a heavy write workload to reproduce once. > {code} > Py4JJavaError: An error occurred while calling o153.sql. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) > at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) > at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) > at py4j.Gateway.invoke(Gateway.java:293) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:226) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) > ... > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > ... 23 more > Caused by: java.util.concurrent.ExecutionException: > org.apache.spark.SparkException: Task not serializable > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) > ... 276 more > Caused by: org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2380) > at >
[jira] [Assigned] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24002: Assignee: Apache Spark (was: Xiao Li) > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Having two queries one is a 1000-line SQL query and a 3000-line SQL query. > Need to run at least one hour with a heavy write workload to reproduce once. > {code} > Py4JJavaError: An error occurred while calling o153.sql. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) > at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) > at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) > at py4j.Gateway.invoke(Gateway.java:293) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:226) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) > ... > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > ... 23 more > Caused by: java.util.concurrent.ExecutionException: > org.apache.spark.SparkException: Task not serializable > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) > ... 276 more > Caused by: org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2380) > at >
[jira] [Commented] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441071#comment-16441071 ] Apache Spark commented on SPARK-24002: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/21086 > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > Having two queries one is a 1000-line SQL query and a 3000-line SQL query. > Need to run at least one hour with a heavy write workload to reproduce once. > {code} > Py4JJavaError: An error occurred while calling o153.sql. > : org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) > at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) > at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) > at py4j.Gateway.invoke(Gateway.java:293) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:226) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) > at > org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) > ... > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) > ... 23 more > Caused by: java.util.concurrent.ExecutionException: > org.apache.spark.SparkException: Task not serializable > at java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.util.concurrent.FutureTask.get(FutureTask.java:206) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) > ... 276 more > Caused by: org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) > at
[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24002: Description: Having two queries one is a 1000-line SQL query and a 3000-line SQL query. Need to run at least one hour with a heavy write workload to reproduce once. {code} Py4JJavaError: An error occurred while calling o153.sql. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646) at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:293) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:226) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267) at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144) ... at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190) ... 23 more Caused by: java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Task not serializable at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179) ... 276 more Caused by: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156) at org.apache.spark.SparkContext.clean(SparkContext.scala:2380) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:850) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:849) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:371) at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:417) at
[jira] [Updated] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
[ https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24002: Description: {code} java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:244) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484) at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) {code} was: {nocode} java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:244) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484) at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) {nocode} > Task not serializable caused by > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes > > > Key: SPARK-24002 > URL: https://issues.apache.org/jira/browse/SPARK-24002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > > {code} > java.lang.IllegalArgumentException > at java.nio.Buffer.position(Buffer.java:244) > at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153) > at java.nio.ByteBuffer.get(ByteBuffer.java:715) > at > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405) > at > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414) > at > org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484) > at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
Xiao Li created SPARK-24002: --- Summary: Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes Key: SPARK-24002 URL: https://issues.apache.org/jira/browse/SPARK-24002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Xiao Li Assignee: Xiao Li {nocode} java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:244) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153) at java.nio.ByteBuffer.get(ByteBuffer.java:715) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414) at org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484) at sun.reflect.GeneratedMethodAccessor190.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) {nocode} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441006#comment-16441006 ] Edwina Lu commented on SPARK-23206: --- [~assia6], could you please try the new link, [https://docs.google.com/document/d/1fIL2XMHPnqs6kaeHr822iTvs08uuYnjP5roSGZfejyA/edit?usp=sharing] > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, SPARK-23206 Design Doc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on
[ https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-23888: - Description: There's a bug in: {code:java} /** Check whether a task is currently running an attempt on a given host */ private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { taskAttempts(taskIndex).exists(_.host == host) } {code} This will ignore hosts which have finished attempts, so we should check whether the attempt is currently running on the given host. And it is possible for a speculative task to run on a host where another attempt failed here before. Assume we have only two machines: host1, host2. We first run task0.0 on host1. Then, due to a long time waiting for task0.0, we launch a speculative task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run since there's already a copy running on host2. After another long time, we launch a new speculative task0.2. And, now, we can run task0.2 on host1 again, since there's no more running attempt on host1. ** After discussion in the PR, we simply make the comment be consistent the method's behavior. See details in PR#20998. was: There's a bug in: {code:java} /** Check whether a task is currently running an attempt on a given host */ private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { taskAttempts(taskIndex).exists(_.host == host) } {code} This will ignore hosts which have finished attempts, so we should check whether the attempt is currently running on the given host. And it is possible for a speculative task to run on a host where another attempt failed here before. Assume we have only two machines: host1, host2. We first run task0.0 on host1. Then, due to a long time waiting for task0.0, we launch a speculative task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run since there's already a copy running on host2. After another long time, we launch a new speculative task0.2. And, now, we can run task0.2 on host1 again, since there's no more running attempt on host1. > speculative task should not run on a given host where another attempt is > already running on > --- > > Key: SPARK-23888 > URL: https://issues.apache.org/jira/browse/SPARK-23888 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: wuyi >Priority: Major > Labels: speculation > Fix For: 2.3.0 > > > > There's a bug in: > {code:java} > /** Check whether a task is currently running an attempt on a given host */ > private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = { >taskAttempts(taskIndex).exists(_.host == host) > } > {code} > This will ignore hosts which have finished attempts, so we should check > whether the attempt is currently running on the given host. > And it is possible for a speculative task to run on a host where another > attempt failed here before. > Assume we have only two machines: host1, host2. We first run task0.0 on > host1. Then, due to a long time waiting for task0.0, we launch a speculative > task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run > since there's already a copy running on host2. After another long time, we > launch a new speculative task0.2. And, now, we can run task0.2 on host1 > again, since there's no more running attempt on host1. > ** > After discussion in the PR, we simply make the comment be consistent the > method's behavior. See details in PR#20998. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24001) Multinode cluster
[ https://issues.apache.org/jira/browse/SPARK-24001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Direselign updated SPARK-24001: --- Attachment: Screenshot from 2018-04-17 22-47-39.png > Multinode cluster > -- > > Key: SPARK-24001 > URL: https://issues.apache.org/jira/browse/SPARK-24001 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Direselign >Priority: Major > Attachments: Screenshot from 2018-04-17 22-47-39.png > > > I was trying to configure Apache spark cluster on two ubuntu 16.04 machines > using yarn but it is not working when I submit a task to the clusters -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23963) Queries on text-based Hive tables grow disproportionately slower as the number of columns increase
[ https://issues.apache.org/jira/browse/SPARK-23963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440942#comment-16440942 ] Ruslan Dautkhanov commented on SPARK-23963: --- Thanks a lot [~bersprockets] Would it be possible to backport this to Spark 2.3 as well ? Thanks! > Queries on text-based Hive tables grow disproportionately slower as the > number of columns increase > -- > > Key: SPARK-23963 > URL: https://issues.apache.org/jira/browse/SPARK-23963 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > TableReader gets disproportionately slower as the number of columns in the > query increase. > For example, reading a table with 6000 columns is 4 times more expensive per > record than reading a table with 3000 columns, rather than twice as expensive. > The increase in processing time is due to several Lists (fieldRefs, > fieldOrdinals, and unwrappers), each of which the reader accesses by column > number for each column in a record. Because each List has O\(n\) time for > lookup by column number, these lookups grow increasingly expensive as the > column count increases. > When I patched the code to change those 3 Lists to Arrays, the query times > became proportional. > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24001) Multinode cluster
Direselign created SPARK-24001: -- Summary: Multinode cluster Key: SPARK-24001 URL: https://issues.apache.org/jira/browse/SPARK-24001 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.0 Reporter: Direselign I was trying to configure Apache spark cluster on two ubuntu 16.04 machines using yarn but it is not working when I submit a task to the clusters -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402271#comment-16402271 ] Ben Doerr edited comment on SPARK-22371 at 4/17/18 2:04 PM: We've seen this for the first time on 2.3.0. The scenario that [~mayank.agarwal2305] described of jobs running in parallel on datasets and union of all datasets and full gc triggered" sounds exactly like our scenario. We've been unable to upgrade because of this issue. was (Author: craftsman): We've seen this for the first time on 2.3.0. > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440887#comment-16440887 ] Apache Spark commented on SPARK-23948: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/21085 > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reassigned SPARK-23948: Assignee: jin xing > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23948) Trigger mapstage's job listener in submitMissingTasks
[ https://issues.apache.org/jira/browse/SPARK-23948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-23948. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21019 [https://github.com/apache/spark/pull/21019] > Trigger mapstage's job listener in submitMissingTasks > - > > Key: SPARK-23948 > URL: https://issues.apache.org/jira/browse/SPARK-23948 > Project: Spark > Issue Type: New Feature > Components: Scheduler, Spark Core >Affects Versions: 2.3.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > SparkContext submitted a map stage from "submitMapStage" to DAGScheduler, > "markMapStageJobAsFinished" is called only in > (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L933 > and > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1314); > But think about below scenario: > 1. stage0 and stage1 are all "ShuffleMapStage" and stage1 depends on stage0; > 2. We submit stage1 by "submitMapStage"; > 3. When stage 1 running, "FetchFailed" happened, stage0 and stage1 got > resubmitted as stage0_1 and stage1_1; > 4. When stage0_1 running, speculated tasks in old stage1 come as succeeded, > but stage1 is not inside "runningStages". So even though all splits(including > the speculated tasks) in stage1 succeeded, job listener in stage1 will not be > called; > 5. stage0_1 finished, stage1_1 starts running. When "submitMissingTasks", > there is no missing tasks. But in current code, job listener is not triggered -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
[ https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22676. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 19868 [https://github.com/apache/spark/pull/19868] > Avoid iterating all partition paths when > spark.sql.hive.verifyPartitionPath=true > > > Key: SPARK-22676 > URL: https://issues.apache.org/jira/browse/SPARK-22676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > In current code, it will scanning all partition paths when > spark.sql.hive.verifyPartitionPath=true. > e.g. table like below: > CREATE TABLE `test`( > `id` int, > `age` int, > `name` string) > PARTITIONED BY ( > `A` string, > `B` string) > load data local inpath '/tmp/data1' into table test partition(A='00', B='00') > load data local inpath '/tmp/data1' into table test partition(A='01', B='01') > load data local inpath '/tmp/data1' into table test partition(A='10', B='10') > load data local inpath '/tmp/data1' into table test partition(A='11', B='11') > If I query with SQL -- "select * from test where year=2017 and month=12 and > day=03", current code will scan all partition paths including > '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', > '/data/A=11/B=11'. It costs much time and memory cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22676) Avoid iterating all partition paths when spark.sql.hive.verifyPartitionPath=true
[ https://issues.apache.org/jira/browse/SPARK-22676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22676: --- Assignee: jin xing > Avoid iterating all partition paths when > spark.sql.hive.verifyPartitionPath=true > > > Key: SPARK-22676 > URL: https://issues.apache.org/jira/browse/SPARK-22676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: jin xing >Assignee: jin xing >Priority: Major > Fix For: 2.4.0 > > > In current code, it will scanning all partition paths when > spark.sql.hive.verifyPartitionPath=true. > e.g. table like below: > CREATE TABLE `test`( > `id` int, > `age` int, > `name` string) > PARTITIONED BY ( > `A` string, > `B` string) > load data local inpath '/tmp/data1' into table test partition(A='00', B='00') > load data local inpath '/tmp/data1' into table test partition(A='01', B='01') > load data local inpath '/tmp/data1' into table test partition(A='10', B='10') > load data local inpath '/tmp/data1' into table test partition(A='11', B='11') > If I query with SQL -- "select * from test where year=2017 and month=12 and > day=03", current code will scan all partition paths including > '/data/A=00/B=00', '/data/A=00/B=00', '/data/A=01/B=01', '/data/A=10/B=10', > '/data/A=11/B=11'. It costs much time and memory cost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23835) When Dataset.as converts column from nullable to non-nullable type, null Doubles are converted silently to -1
[ https://issues.apache.org/jira/browse/SPARK-23835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23835. - Resolution: Fixed Assignee: Marco Gaido Fix Version/s: 2.4.0 2.3.1 > When Dataset.as converts column from nullable to non-nullable type, null > Doubles are converted silently to -1 > - > > Key: SPARK-23835 > URL: https://issues.apache.org/jira/browse/SPARK-23835 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > I constructed a DataFrame with a nullable java.lang.Double column (and an > extra Double column). I then converted it to a Dataset using ```as[(Double, > Double)]```. When the Dataset is shown, it has a null. When it is collected > and printed, the null is silently converted to a -1. > Code snippet to reproduce this: > {code} > val localSpark = spark > import localSpark.implicits._ > val df = Seq[(java.lang.Double, Double)]( > (1.0, 2.0), > (3.0, 4.0), > (Double.NaN, 5.0), > (null, 6.0) > ).toDF("a", "b") > df.show() // OUTPUT 1: has null > df.printSchema() > val data = df.as[(Double, Double)] > data.show() // OUTPUT 2: has null > data.collect().foreach(println) // OUTPUT 3: has -1 > {code} > OUTPUT 1 and 2: > {code} > ++---+ > | a| b| > ++---+ > | 1.0|2.0| > | 3.0|4.0| > | NaN|5.0| > |null|6.0| > ++---+ > {code} > OUTPUT 3: > {code} > (1.0,2.0) > (3.0,4.0) > (NaN,5.0) > (-1.0,6.0) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440855#comment-16440855 ] Thomas Graves commented on SPARK-15703: --- this Jira is purely making the size of the event queue configurable which would allow you to increase it as long as you have sufficient driver memory. There is no current fix for it dropping events. There is a fix that went into 2.3 that makes it so the critical services aren't affected: https://issues.apache.org/jira/browse/SPARK-18838 > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, > spark-dynamic-executor-allocation.png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK
[ https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440841#comment-16440841 ] Brahma Reddy Battula commented on SPARK-24000: -- Discussed [~ste...@apache.org] with offline, create table code should be thinking of doing a getFileStatus() on the path to see what's there and so fail on any no-permissions situation. Please forgive me,if I am not clear. I am a newbie to spark. > S3A: Create Table should fail on invalid AK/SK > -- > > Key: SPARK-24000 > URL: https://issues.apache.org/jira/browse/SPARK-24000 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Brahma Reddy Battula >Priority: Major > > Currently, When we pass the i{color:#FF}nvalid ak&{color} *create > table* will be the *success*. > when the S3AFileSystem is initialized, *verifyBucketExists*() is called, > which will return *True* as the status code 403 > (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_* _from following as bucket exists._ > {code:java} > public boolean doesBucketExist(String bucketName) > throws AmazonClientException, AmazonServiceException { > > try { > headBucket(new HeadBucketRequest(bucketName)); > return true; > } catch (AmazonServiceException ase) { > // A redirect error or a forbidden error means the bucket exists. So > // returning true. > if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE) > || (ase.getStatusCode() == > Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) { > return true; > } > if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) { > return false; > } > throw ase; > > } > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23875) Create IndexedSeq wrapper for ArrayData
[ https://issues.apache.org/jira/browse/SPARK-23875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-23875. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.4.0 > Create IndexedSeq wrapper for ArrayData > --- > > Key: SPARK-23875 > URL: https://issues.apache.org/jira/browse/SPARK-23875 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We don't have a good way to sequentially access {{UnsafeArrayData}} with a > common interface such as Seq. An example is {{MapObject}} where we need to > access several sequence collection types together. But {{UnsafeArrayData}} > doesn't implement {{ArrayData.array}}. Calling {{toArray}} will copy the > entire array. We can provide an {{IndexedSeq}} wrapper for {{ArrayData}}, so > we can avoid copying the entire array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK
Brahma Reddy Battula created SPARK-24000: Summary: S3A: Create Table should fail on invalid AK/SK Key: SPARK-24000 URL: https://issues.apache.org/jira/browse/SPARK-24000 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.3.0 Reporter: Brahma Reddy Battula Currently, When we pass the i{color:#FF}nvalid ak&{color} *create table* will be the *success*. when the S3AFileSystem is initialized, *verifyBucketExists*() is called, which will return *True* as the status code 403 (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_* _from following as bucket exists._ {code:java} public boolean doesBucketExist(String bucketName) throws AmazonClientException, AmazonServiceException { try { headBucket(new HeadBucketRequest(bucketName)); return true; } catch (AmazonServiceException ase) { // A redirect error or a forbidden error means the bucket exists. So // returning true. if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE) || (ase.getStatusCode() == Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) { return true; } if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) { return false; } throw ase; } }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23999) Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our production environment?
Prabhu Bentick created SPARK-23999: -- Summary: Spark SQL shell is a Stable one ? Can we use Spark SQL shell in our production environment? Key: SPARK-23999 URL: https://issues.apache.org/jira/browse/SPARK-23999 Project: Spark Issue Type: Question Components: Spark Core, Spark Shell, Spark Submit Affects Versions: 2.3.0 Reporter: Prabhu Bentick Fix For: 3.0.0, 2.3.0 Hi Team, Understand from one of developer working in Apache Spark that Spark SQL shell is not a stable one & it is not recomended to use Spark SQL Shell in production environment, Is it true? Apache Spark 3.0 version going to release the stable version. Can anyone please clarify. Regards Prabhu Bentick -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23747) Add EpochCoordinator unit tests
[ https://issues.apache.org/jira/browse/SPARK-23747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-23747: - Assignee: Jose Torres > Add EpochCoordinator unit tests > --- > > Key: SPARK-23747 > URL: https://issues.apache.org/jira/browse/SPARK-23747 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23747) Add EpochCoordinator unit tests
[ https://issues.apache.org/jira/browse/SPARK-23747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23747. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20983 [https://github.com/apache/spark/pull/20983] > Add EpochCoordinator unit tests > --- > > Key: SPARK-23747 > URL: https://issues.apache.org/jira/browse/SPARK-23747 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683 ] Andrew Clegg edited comment on SPARK-22371 at 4/17/18 9:53 AM: --- Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {code:none} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365 at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194) ... 31 more {code} was (Author: aclegg): Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365}} {{ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}} {{ at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}} {{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at scala.Option.foreach(Option.scala:257)}} {{ at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}} {{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}} {{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}} {{ ... 31 more}} > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our
[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440683#comment-16440683 ] Andrew Clegg commented on SPARK-22371: -- Another data point -- I've seen this happen (in 2.3.0) during cleanup after a task failure: {{Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: java.lang.IllegalStateException: Attempted to access garbage collected accumulator 365}} {{ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)}} {{ at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)}} {{ at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)}} {{ at scala.Option.foreach(Option.scala:257)}} {{ at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)}} {{ at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)}} {{ at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)}} {{ at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)}} {{ at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)}} {{ at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)}} {{ ... 31 more}} > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Resolved] (SPARK-23687) Add MemoryStream
[ https://issues.apache.org/jira/browse/SPARK-23687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23687. --- Resolution: Done Fix Version/s: 2.4.0 > Add MemoryStream > > > Key: SPARK-23687 > URL: https://issues.apache.org/jira/browse/SPARK-23687 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.4.0 > > > We need a MemoryStream for continuous processing, both in order to write less > fragile tests and to eventually use existing stream tests to verify > functional equivalence. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23687) Add MemoryStream
[ https://issues.apache.org/jira/browse/SPARK-23687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-23687: - Assignee: Jose Torres > Add MemoryStream > > > Key: SPARK-23687 > URL: https://issues.apache.org/jira/browse/SPARK-23687 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Jose Torres >Priority: Major > Fix For: 2.4.0 > > > We need a MemoryStream for continuous processing, both in order to write less > fragile tests and to eventually use existing stream tests to verify > functional equivalence. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23918) High-order function: array_min(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23918. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21025 [https://github.com/apache/spark/pull/21025] > High-order function: array_min(x) → x > - > > Key: SPARK-23918 > URL: https://issues.apache.org/jira/browse/SPARK-23918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the minimum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23918) High-order function: array_min(x) → x
[ https://issues.apache.org/jira/browse/SPARK-23918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23918: - Assignee: Marco Gaido > High-order function: array_min(x) → x > - > > Key: SPARK-23918 > URL: https://issues.apache.org/jira/browse/SPARK-23918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the minimum value of input array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side
[ https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23998: Assignee: Apache Spark > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side > -- > > Key: SPARK-23998 > URL: https://issues.apache.org/jira/browse/SPARK-23998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: eaton >Assignee: Apache Spark >Priority: Minor > > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side
[ https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23998: Assignee: (was: Apache Spark) > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side > -- > > Key: SPARK-23998 > URL: https://issues.apache.org/jira/browse/SPARK-23998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: eaton >Priority: Minor > > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23998) It may be better to add @transient to field 'taskMemoryManager' in class Task, for it is only be set and used in executor side
[ https://issues.apache.org/jira/browse/SPARK-23998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16440598#comment-16440598 ] Apache Spark commented on SPARK-23998: -- User 'eatoncys' has created a pull request for this issue: https://github.com/apache/spark/pull/21084 > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side > -- > > Key: SPARK-23998 > URL: https://issues.apache.org/jira/browse/SPARK-23998 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: eaton >Priority: Minor > > It may be better to add @transient to field 'taskMemoryManager' in class > Task, for it is only be set and used in executor side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org